{"title": "Conditional Adversarial Domain Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 1640, "page_last": 1650, "abstract": "Adversarial learning has been embedded into deep networks to learn disentangled and transferable representations for domain adaptation. Existing adversarial domain adaptation methods may struggle to align different domains of multimodal distributions that are native in classification problems. In this paper, we present conditional adversarial domain adaptation, a principled framework that conditions the adversarial adaptation models on discriminative information conveyed in the classifier predictions. Conditional domain adversarial networks (CDANs) are designed with two novel conditioning strategies: multilinear conditioning that captures the cross-covariance between feature representations and classifier predictions to improve the discriminability, and entropy conditioning that controls the uncertainty of classifier predictions to guarantee the transferability. Experiments testify that the proposed approach exceeds the state-of-the-art results on five benchmark datasets.", "full_text": "Conditional Adversarial Domain Adaptation\n\nMingsheng Long\u2020, Zhangjie Cao\u2020, Jianmin Wang\u2020, and Michael I. Jordan(cid:93)\n\u2020KLiss, MOE; BNRist; Research Center for Big Data, Tsinghua University, China\n\n\u2020School of Software, Tsinghua University, China\n\n(cid:93)University of California, Berkeley, Berkeley, USA\n\n{mingsheng, jimwang}@tsinghua.edu.cn\n\ncaozhangjie14@gmail.com\n\njordan@berkeley.edu\n\nAbstract\n\nAdversarial learning has been embedded into deep networks to learn disentangled\nand transferable representations for domain adaptation. Existing adversarial do-\nmain adaptation methods may not effectively align different domains of multimodal\ndistributions native in classi\ufb01cation problems. In this paper, we present conditional\nadversarial domain adaptation, a principled framework that conditions the adver-\nsarial adaptation models on discriminative information conveyed in the classi\ufb01er\npredictions. Conditional domain adversarial networks (CDANs) are designed with\ntwo novel conditioning strategies: multilinear conditioning that captures the cross-\ncovariance between feature representations and classi\ufb01er predictions to improve the\ndiscriminability, and entropy conditioning that controls the uncertainty of classi\ufb01er\npredictions to guarantee the transferability. With theoretical guarantees and a few\nlines of codes, the approach has exceeded state-of-the-art results on \ufb01ve datasets.\n\n1\n\nIntroduction\n\nDeep networks have signi\ufb01cantly improved the state-of-the-arts for diverse machine learning problems\nand applications. When trained on large-scale datasets, deep networks learn representations which\nare generically useful across a variety of tasks [36, 11, 54]. However, deep networks can be weak\nat generalizing learned knowledge to new datasets or environments. Even a subtle change from the\ntraining domain can cause deep networks to make spurious predictions on the target domain [54, 36].\nWhile in many real applications, there is the need to transfer a deep network from a source domain\nwhere suf\ufb01cient training data is available to a target domain where only unlabeled data is available,\nsuch a transfer learning paradigm is hindered by the shift in data distributions across domains [39].\nLearning a model that reduces the dataset shift between training and testing distributions is known as\ndomain adaptation [38]. Previous domain adaptation methods in the shallow regime either bridge the\nsource and target by learning invariant feature representations or estimating instance importances\nusing labeled source data and unlabeled target data [24, 37, 15]. Recent advances of deep domain\nadaptation methods leverage deep networks to learn transferable representations by embedding\nadaptation modules in deep architectures, simultaneously disentangling the explanatory factors of\nvariations behind data and matching feature distributions across domains [12, 13, 29, 52, 31, 30, 51].\nAdversarial domain adaptation [12, 52, 51] integrates adversarial learning and domain adaptation in a\ntwo-player game similarly to Generative Adversarial Networks (GANs) [17]. A domain discriminator\nis learned by minimizing the classi\ufb01cation error of distinguishing the source from the target domains,\nwhile a deep classi\ufb01cation model learns transferable representations that are indistinguishable by the\ndomain discriminator. On par with these feature-level approaches, generative pixel-level adaptation\nmodels perform distribution alignment in raw pixel space, by translating source data to the style of a\ntarget domain using Image to Image translation techniques [56, 28, 22, 43]. Another line of works\nalign distributions of features and classes separately using different domain discriminators [23, 8, 50].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fDespite their general ef\ufb01cacy for various tasks ranging from classi\ufb01cation [12, 51, 28] to segmentation\n[43, 50, 22], these adversarial domain adaptation methods may still be constrained by two bottlenecks.\nFirst, when data distributions embody complex multimodal structures, adversarial adaptation methods\nmay fail to capture such multimodal structures for a discriminative alignment of distributions without\nmode mismatch. Such a risk comes from the equilibrium challenge of adversarial learning in that\neven if the discriminator is fully confused, we have no guarantee that two distributions are suf\ufb01ciently\nsimilar [3]. Note that this risk cannot be tackled by aligning distributions of features and classes via\nseparate domain discriminators as [23, 8, 50], since the multimodal structures can only be captured\nsuf\ufb01ciently by the cross-covariance dependency between the features and classes [47, 44]. Second, it\nis risky to condition the domain discriminator on the discriminative information when it is uncertain.\nIn this paper, we tackle the two aforementioned challenges by formalizing a conditional adversarial\ndomain adaptation framework. Recent advances in the Conditional Generative Adversarial Networks\n(CGANs) [34, 35] disclose that the distributions of real and generated images can be made similar\nby conditioning the generator and discriminator on discriminative information. Motivated by the\nconditioning insight, this paper presents Conditional Domain Adversarial Networks (CDANs) to ex-\nploit discriminative information conveyed in the classi\ufb01er predictions to assist adversarial adaptation.\nThe key to the CDAN models is a novel conditional domain discriminator conditioned on the cross-\ncovariance of domain-speci\ufb01c feature representations and classi\ufb01er predictions. We further condition\nthe domain discriminator on the uncertainty of classi\ufb01er predictions, prioritizing the discriminator on\neasy-to-transfer examples. The overall system can be solved in linear-time through back-propagation.\nBased on the domain adaptation theory [4], we give a theoretical guarantee on the generalization error\nbound. Experiments show that our models exceed state-of-the-art results on \ufb01ve benchmark datasets.\n\n2 Related Work\n\nDomain adaptation [38, 39] generalizes a learner across different domains of different distributions,\nby either matching the marginal distributions [49, 37, 15] or the conditional distributions [55, 10]. It\n\ufb01nds wide applications in computer vision [42, 18, 16, 21] and natural language processing [9, 14].\nBesides the aforementioned shallow architectures, recent studies reveal that deep networks learn more\ntransferable representations that disentangle the explanatory factors of variations behind data [6] and\nmanifest invariant factors underlying different populations [14, 36]. As deep representations can\nonly reduce, but not remove, the cross-domain distribution discrepancy [54], recent research on deep\ndomain adaptation further embeds adaptation modules in deep networks using two main technologies\nfor distribution matching: moment matching [29, 31, 30] and adversarial training [12, 52, 13, 51].\nPioneered by the Generative Adversarial Networks (GANs) [17], the adversarial learning has been\nsuccessfully explored for generative modeling. GANs constitute two networks in a two-player game:\na generator that captures data distribution and a discriminator that distinguishes between generated\nsamples and real data. The networks are trained in a minimax paradigm such that the generator is\nlearned to fool the discriminator while the discriminator struggles to be not fooled. Several dif\ufb01culties\nof GANs have been addressed, e.g. improved training [2, 1] and mode collapse [34, 7, 35], but others\nstill remain, e.g. failure in matching two distributions [3]. Towards adversarial learning for domain\nadaptation, unconditional ones have been leveraged while conditional ones remain under explored.\nSharing some spirit of the conditional GANs [3], another line of works match the features and classes\nusing separate domain discriminators. Hoffman et al. [23] performs global domain alignment by\nlearning features to deceive the domain discriminator, and category speci\ufb01c adaptation by minimizing\na constrained multiple instance loss. In particular, the adversarial module for feature representation is\nnot conditioned on the class-adaptation module with class information. Chen et al. [8] performs class-\nwise alignment over the classi\ufb01er layer; i.e., multiple domain discriminators take as inputs only the\nsoftmax probabilities of source classi\ufb01er, rather than conditioned on the class information. Tsai et al.\n[50] imposes two independent domain discriminators on the feature and class layers. These methods\ndo not explore the dependency between the features and classes in a uni\ufb01ed conditional domain\ndiscriminator, which is important to capture the multimodal structures underlying data distributions.\nThis paper extends the conditional adversarial mechanism to enable discriminative and transferable\ndomain adaptation, by de\ufb01ning the domain discriminator on the features while conditioning it on the\nclass information. Two novel conditioning strategies are designed to capture the cross-covariance\ndependency between the feature representations and class predictions while controlling the uncertainty\nof classi\ufb01er predictions. This is different from aligning the features and classes separately [23, 8, 50].\n\n2\n\n\fi , ys\n\ni )}ns\n\n3 Conditional Adversarial Domain Adaptation\nIn unsupervised domain adaptation, we are given a source domain Ds = {(xs\ni=1 of ns labeled\nexamples and a target domain Dt = {xt\nj}nt\nj=1 of nt unlabeled examples. The source domain and\ntarget domain are sampled from joint distributions P (xs, ys) and Q(xt, yt) respectively, and the i.i.d.\nassumption is violated as P (cid:54)= Q. The goal of this paper is to design a deep network G : x (cid:55)\u2192 y which\nformally reduces the shifts in the data distributions across domains, such that the target risk \u0001t (G) =\nE(xt,yt)\u223cQ [G (xt) (cid:54)= yt] can be bounded by the source risk \u0001s (G) = E(xs,ys)\u223cP [G (xs) (cid:54)= ys]\nplus the distribution discrepancy disc(P, Q) quanti\ufb01ed by a novel conditional domain discriminator.\nAdversarial learning, the key idea to enabling Generative Adversarial Networks (GANs) [17], has been\nsuccessfully explored to minimize the cross-domain discrepancy [13, 51]. Denote by f = F (x) the\nfeature representation and by g = G (x) the classi\ufb01er prediction generated from the deep network G.\nDomain adversarial neural network (DANN) [13] is a two-player game: the \ufb01rst player is the domain\ndiscriminator D trained to distinguish the source domain from the target domain and the second\nplayer is the feature representation F trained simultaneously to confuse the domain discriminator D.\nThe error function of the domain discriminator corresponds well to the discrepancy between feature\ndistributions P (f ) and Q(f ) [12], a key to bound the target risk in the domain adaptation theory [4].\n\n3.1 Conditional Discriminator\n\nWe further improve existing adversarial domain adaptation methods [12, 52, 51] in two directions.\nFirst, when the joint distributions of feature and class, i.e. P (xs, ys) and Q(xt, yt), are non-identical\nacross domains, adapting only the feature representation f may be insuf\ufb01cient. Due to a quantitative\nstudy [54], deep representations eventually transition from general to speci\ufb01c along deep networks,\nwith transferability decreased remarkably in the domain-speci\ufb01c feature layer f and classi\ufb01er layer g.\nSecond, when the feature distribution is multimodal, which is a real scenario due to the nature of\nmulti-class classi\ufb01cation, adapting only the feature representation may be challenging for adversarial\nnetworks. Recent work [17, 2, 7, 1] reveals the high risk of failure in matching only a fraction of\ncomponents underlying different distributions with the adversarial networks. Even if the discriminator\nis fully confused, we have no theoretical guarantee that two different distributions are identical [3].\nThis paper tackles the two aforementioned challenges by formalizing a conditional adversarial domain\nadaptation framework. Recent advances in Conditional Generative Adversarial Networks (CGANs)\n[34] discover that different distributions can be matched better by conditioning the generator and\ndiscriminator on relevant information, such as associated labels and af\ufb01liated modality. Conditional\nGANs [25, 35] generate globally coherent images from datasets with high variability and multimodal\ndistributions. Motivated by conditional GANs, we observe that in adversarial domain adaptation, the\nclassi\ufb01er prediction g conveys discriminative information potentially revealing the multimodal struc-\ntures, which can be conditioned on when adapting feature representation f. By conditioning, domain\nvariances in both feature representation f and classi\ufb01er prediction g can be modeled simultaneously.\nWe formulate Conditional Domain Adversarial Network (CDAN) as a minimax optimization problem\nwith two competitive error terms: (a) E(G) on the source classi\ufb01er G, which is minimized to guarantee\nlower source risk; (b) E(D, G) on the source classi\ufb01er G and the domain discriminator D across the\nsource and target domains, which is minimized over D but maximized over f = F (x) and g = G(x):\n\nE(G) = E(xs\n\ni ,ys\n\ni )\u223cDsL (G (xs\ni )] \u2212 Ext\n\ni ) , ys\n\nj\u223cDt log(cid:2)1 \u2212 D(cid:0)f t\n\ni ) ,\n\n(cid:1)(cid:3) ,\n\n(1)\n\nE(D, G) = \u2212Exs\n\n(2)\nwhere L(\u00b7,\u00b7) is the cross-entropy loss, and h = (f , g) is the joint variable of feature representation f\nand classi\ufb01er prediction g. The minimax game of conditional domain adversarial network (CDAN) is\n\ni\u223cDs log [D (f s\n\ni , gs\n\nj , gt\nj\n\nE(G) \u2212 \u03bbE(D, G)\nE(D, G),\n\nmin\n\nG\n\nmin\nD\n\n(3)\n\nwhere \u03bb is a hyper-parameter between the two objectives to tradeoff source risk and domain adversary.\nWe condition domain discriminator D on the classi\ufb01er prediction g through joint variable h = (f , g).\nThis conditional domain discriminator can potentially tackle the two aforementioned challenges of\n\n3\n\n\f(a) Multilinear Conditioning\n\n(b) Randomized Multilinear Conditioning\n\nFigure 1: Architectures of Conditional Domain Adversarial Networks (CDAN) for domain adaptation,\nwhere domain-speci\ufb01c feature representation f and classi\ufb01er prediction g embody the cross-domain\ngap to be reduced jointly by the conditional domain discriminator D. (a) Multilinear (M) Conditioning,\napplicable to lower-dimensional scenario, where D is conditioned on classi\ufb01er prediction g via multi-\nlinear map f \u2297 g; (b) Randomized Multilinear (RM) Conditioning, \ufb01t to higher-dimensional scenario,\n(Rf f )(cid:12) (Rgg).\nwhere D is conditioned on classi\ufb01er prediction g via randomized multilinear map 1\u221a\nEntropy Conditioning (dashed line) leads to CDAN+E that prioritizes D on easy-to-transfer examples.\n\nd\n\nadversarial domain adaptation. A simple conditioning of D is D(f \u2295 g), where we concatenate the\nfeature representation and classi\ufb01er prediction in vector f \u2295 g and feed it to conditional domain\ndiscriminator D. This conditioning strategy is widely adopted by existing conditional GANs [34, 25,\n35]. However, with the concatenation strategy, f and g are independent on each other, thus failing\nto fully capture multiplicative interactions between feature representation and classi\ufb01er prediction,\nwhich are crucial to domain adaptation. As a result, the multimodal information conveyed in classi\ufb01er\nprediction cannot be fully exploited to match the multimodal distributions of complex domains [47].\n\n3.2 Multilinear Conditioning\n\nThe multilinear map is de\ufb01ned as the outer product of multiple random vectors. And the multilinear\nmap of in\ufb01nite-dimensional nonlinear feature maps has been successfully applied to embed joint\ndistribution or conditional distribution into reproducing kernel Hilbert spaces [47, 44, 45, 30]. Given\ntwo random vectors x and y, the joint distribution P (x, y) can be modeled by the cross-covariance\nExy[\u03c6(x) \u2297 \u03c6(y)], where \u03c6 is a feature map induced by some reproducing kernel. Such kernel\nembeddings enable manipulation of the multiplicative interactions across multiple random variables.\nBesides the theoretical bene\ufb01t of the multilinear map x\u2297 y over the concatenation x\u2295 y [47, 46], we\nfurther explain its superiority intuitively. Assume linear map \u03c6(x) = x and one-hot label vector y in\nC classes. As can be veri\ufb01ed, mean map Exy[x \u2295 y] = Ex[x] \u2295 Ey[y] computes the means of x and\ny independently. In contrast, mean map Exy [x \u2297 y] = Ex [x|y = 1]\u2295 . . .\u2295Ex [x|y = C] computes\nthe means of each of the C class-conditional distributions P (x|y). Superior than concatenation, the\nmultilinear map x \u2297 y can fully capture the multimodal structures behind complex data distributions.\nTaking the advantage of multilinear map, in this paper, we condition D on g with the multilinear map\n(4)\nwhere T\u2297 is a multilinear map and D(f , g) = D(f \u2297 g). As such, the conditional domain discrimi-\nnator successfully models the multimodal information and joint distributions of f and g. Also, the\nmulti-linearity can accommodate random vectors f and g with different cardinalities and magnitudes.\nA disadvantage of the multilinear map is dimension explosion. Denoting by df and dg the dimensions\nof vectors f and g respectively, the dimension of multilinear map f \u2297 g is df \u00d7 dg, often too high-\ndimensional to be embedded into deep networks without causing parameter explosion. This paper\naddresses the dimension explosion by randomized methods [40, 26]. Note that multilinear map holds\n\nT\u2297 (f , g) = f \u2297 g,\n\n(cid:104)T\u2297 (f , g) , T\u2297 (f(cid:48), g(cid:48))(cid:105) = (cid:104)f \u2297 g, f(cid:48) \u2297 g(cid:48)(cid:105)\n\n= (cid:104)f , f(cid:48)(cid:105)(cid:104)g, g(cid:48)(cid:105)\n\u2248 (cid:104)T(cid:12) (f , g) , T(cid:12) (f(cid:48), g(cid:48))(cid:105) ,\n\n(5)\n\nwhere T(cid:12) (f , g) is the explicit randomized multilinear map of dimension d (cid:28) df \u00d7 dg. We de\ufb01ne\n(6)\n\n(Rf f ) (cid:12) (Rgg) ,\n\nT(cid:12) (f , g) =\n\n1\u221a\nd\n\n4\n\nlossxsxtgtgsfsftysytDNN:AlexNetResNet\u2026\u2026D\u00d7\u00d7lossxsxtgtgsfsftysytDNN:AlexNetResNet\u2026\u2026DfRfRgRgR\fwhere (cid:12) is element-wise product, Rf and Rg are random matrices sampled only once and \ufb01xed\nin training, and each element Rij follows a symmetric distribution with univariance, i.e. E [Rij] =\n0, E[R2\nij] = 1. Applicable distributions include Gaussian distribution and Uniform distribution. As\nthe inner-product on T\u2297 can be accurately approximated by the inner-product on T(cid:12), we can directly\nadopt T(cid:12)(f , g) for computation ef\ufb01ciency. We guarantee such approximation quality by a theorem.\nTheorem 1. The expectation and variance of using T(cid:12) (f , g) (6) to approximate T\u2297 (f , g) (4) satisfy\n\nwhere \u03b2(cid:0)Rf\n\ni , f(cid:1) = 1\n\nd\n\nE [(cid:104)T(cid:12) (f , g) , T(cid:12) (f(cid:48), g(cid:48))(cid:105)] = (cid:104)f , f(cid:48)(cid:105)(cid:104)g, g(cid:48)(cid:105) ,\n\nvar [(cid:104)T(cid:12) (f , g) , T(cid:12) (f(cid:48), g(cid:48))(cid:105)] =(cid:80)d\n(cid:80)df\n\ni=1 \u03b2(cid:0)Rf\n\ni , f(cid:1) \u03b2 (Rg\n\nj f(cid:48)\nj=1 [f 2\n\nj\n\n2E[(Rf\n\nij)4] + C(cid:48)] and similarly for \u03b2 (Rg\n\ni , g) + C,\n\n(7)\ni , g), C, C(cid:48) are constants.\n\nProof. The proof is given in the supplemental material.\nThis veri\ufb01es that T(cid:12) is an unbiased estimate of T\u2297 in terms of inner product, with estimation variance\ndepending only on the fourth-order moments E[(Rf\nij)4], which are constants for many\nsymmetric distributions with univariance, including Gaussian distribution and Uniform distribution.\nThe bound reveals that wen can further minimize the approximation error by normalizing the features.\nFor simplicity we de\ufb01ne the conditioning strategy used by the conditional domain discriminator D as\n\nij)4] and E[(Rg\n\n(cid:26)T\u2297 (f , g)\n\nT(cid:12) (f , g)\n\nT (h) =\n\nif df \u00d7 dg (cid:54) 4096\notherwise,\n\nwhere 4096 is the largest number of units in typical deep networks (e.g. AlexNet), and if dimension\nof the multilinear map T\u2297 is larger than 4096, then we will choose randomized multilinear map T(cid:12).\n\n3.3 Conditional Domain Adversarial Network\n\nWe enable conditional adversarial domain adaptation over domain-speci\ufb01c feature representation f\nand classi\ufb01er prediction g. We jointly minimize (1) w.r.t. source classi\ufb01er G and feature extractor F ,\nminimize (2) w.r.t. domain discriminator D, and maximize (2) w.r.t. feature extractor F and source\nclassi\ufb01er G. This yields the minimax problem of Conditional Domain Adversarial Network (CDAN):\n\n(8)\n\n(9)\n\nmin\n\nG\n\nE(xs\n\ni ,ys\n\n(cid:16)Exs\n\n+ \u03bb\n\ni )\u223cDs L (G (xs\n\ni\u223cDs log [D (T (hs\n\ni ) , ys\ni )\ni ))] + Ext\n\n(cid:1)(cid:1)(cid:3)(cid:17)\nj\u223cDt log(cid:2)1 \u2212 D(cid:0)T(cid:0)ht\n(cid:1)(cid:1)(cid:3) ,\nj\u223cDt log(cid:2)1 \u2212 D(cid:0)T(cid:0)ht\n\nj\n\nj\n\nExs\n\nmax\n\nD\n\ni\u223cDs log [D (T (hs\n\ni ))] + Ext\n\nwhere \u03bb is a hyper-parameter between source classi\ufb01er and conditional domain discriminator, and\nnote that h = (f , g) is the joint variable of domain-speci\ufb01c feature representation f and classi\ufb01er\nprediction g for adversarial adaptation. As a rule of thumb, we can safely set f as the last feature layer\nrepresentation and g as the classi\ufb01er layer prediction. In cases where lower-layer features are not\ntransferable as in pixel-level adaptation tasks [25, 22], we can change f to lower-layer representations.\nEntropy Conditioning The minimax problem for the conditional domain discriminator (9) imposes\nequal importance for different examples, while hard-to-transfer examples with uncertain predictions\nmay deteriorate the conditional adversarial adaptation procedure. Towards safe transfer, we quantify\nc=1 gc log gc, where C\nis the number of classes and gc is the probability of predicting an example to class c. We prioritize the\ndiscriminator on those easy-to-transfer examples with certain predictions by reweighting each training\nexample of the conditional domain discriminator by an entropy-aware weight w(H (g)) = 1+e\u2212H(g).\nThe entropy conditioning variant of CDAN (CDAN+E) for improved transferability is formulated as\n\nthe uncertainty of classi\ufb01er predictions by the entropy criterion H (g) = \u2212(cid:80)C\n\nmin\n\nG\n\nE(xs\n\ni ,ys\n\n(cid:16)Exs\n\n+ \u03bb\n\nExs\n\nmax\n\nD\n\ni )\u223cDsL (G (xs\n\ni ) , ys\ni )\n\ni\u223cDsw (H (gs\n\ni )) log [D (T (hs\n\ni ))] + Ext\n\ni\u223cDsw (H (gs\n\ni )) log [D (T (hs\n\ni ))] + Ext\n\nj\u223cDtw(cid:0)H(cid:0)gt\nj\u223cDtw(cid:0)H(cid:0)gt\n\n(cid:1)(cid:1)(cid:3)(cid:17)\n(cid:1)(cid:1) log(cid:2)1 \u2212 D(cid:0)T(cid:0)ht\n(cid:1)(cid:1)(cid:3) .\n(cid:1)(cid:1) log(cid:2)1 \u2212 D(cid:0)T(cid:0)ht\n\nj\n\nj\n\nj\n\nj\n\n(10)\n\nThe domain discriminator empowers the entropy minimization principle [19] and encourages certain\npredictions, enabling CDAN+E to further perform semi-supervised learning on unlabeled target data.\n\n5\n\n\f3.4 Generalization Error Analysis\nWe give an analysis of the CDAN method taking similar formalism of the domain adaptation theory [5,\n4]. We \ufb01rst consider the source and target domains over the \ufb01xed representation space f = F (x), and a\nfamily of source classi\ufb01ers G in hypothesis space H [13]. Denote by \u0001P (G) = E(f ,y)\u223cP [G (f ) (cid:54)= y]\nthe risk of a hypothesis G \u2208 H w.r.t. distribution P , and \u0001P (G1, G2) = E(f ,y)\u223cP [G1 (f ) (cid:54)= G2 (f )]\nthe disagreement between hypotheses G1, G2 \u2208 H. Let G\u2217 = arg minG \u0001P (G) + \u0001Q (G) be the\nideal hypothesis that explicitly embodies the notion of adaptability. The probabilistic bound [4] of the\ntarget risk \u0001Q(G) of hypothesis G is given by the source risk \u0001P (G) plus the distribution discrepancy\n(11)\nThe goal of domain adaptation is to reduce the distribution discrepancy |\u0001P (G, G\u2217) \u2212 \u0001Q (G, G\u2217)|.\nBy de\ufb01nition, \u0001P (G, G\u2217) = E(f ,y)\u223cP [G (f ) (cid:54)= G\u2217 (f )] = E(f ,g)\u223cPG [g (cid:54)= G\u2217 (f )] = \u0001PG (G\u2217), and\nsimilarly, \u0001Q (G, G\u2217) = \u0001QG (G\u2217). Note that, PG = (f , G (f ))f\u223cP (f ) and QG = (f , G (f ))f\u223cQ(f )\nare the proxies of the joint distributions P (f , y) and Q(f , y), respectively [10]. Based on the proxies,\n|\u0001P (G, G\u2217) \u2212 \u0001Q (G, G\u2217)| = |\u0001PG (G\u2217) \u2212 \u0001QG (G\u2217)|. De\ufb01ne a (loss) difference hypothesis space\n\u2206 (cid:44) {\u03b4 = |g \u2212 G\u2217 (f )| : G\u2217 \u2208 H} over the joint variable (f , g), where \u03b4 : (f , g) (cid:55)\u2192 {0, 1} outputs\nthe loss of G\u2217 \u2208 H. Based on the above difference hypothesis space \u2206, we de\ufb01ne the \u2206-distance as\n\n\u0001Q (G) (cid:54) \u0001P (G) + [\u0001P (G\u2217) + \u0001Q (G\u2217)] + |\u0001P (G, G\u2217) \u2212 \u0001Q (G, G\u2217)| .\n\nd\u2206 (PG, QG) (cid:44) sup\u03b4\u2208\u2206\n= supG\u2217\u2208H\n\n(cid:62)(cid:12)(cid:12)E(f ,g)\u223cPG [g (cid:54)= G\n\n(cid:12)(cid:12)E(f ,g)\u223cPG [\u03b4 (f , g) (cid:54)= 0] \u2212 E(f ,g)\u223cQG [\u03b4 (f , g) (cid:54)= 0](cid:12)(cid:12)\n(cid:12)(cid:12)E(f ,g)\u223cPG [|g \u2212 G\n(f )](cid:12)(cid:12) = |\u0001PG (G\n\n(f )| (cid:54)= 0] \u2212 E(f ,g)\u223cQG [|g \u2212 G\n\u2217\n\n(f )] \u2212 E(f ,g)\u223cQG [g (cid:54)= G\n\u2217\n\n\u2217\n\n\u2217\n\n(f )| (cid:54)= 0](cid:12)(cid:12)\n\n\u2217\n\n) \u2212 \u0001QG (G\n\u2217\n\n)| .\n\n(12)\n\nHence, the domain discrepancy |\u0001P (G, G\u2217) \u2212 \u0001Q (G, G\u2217)| can be upper-bounded by the \u2206-distance.\nSince the difference hypothesis space \u2206 is a continuous function class, assume the family of domain\ndiscriminators HD is rich enough to contain \u2206, \u2206 \u2282 HD. Such an assumption is not unrealistic as\nwe have the freedom to choose HD, for example, a multilayer perceptrons that can \ufb01t any functions.\nGiven these assumptions, we show that training domain discriminator D is related to d\u2206 (PG, QG):\n\n(cid:12)(cid:12)E(f ,g)\u223cPG [D (f , g) (cid:54)= 0] \u2212 E(f ,g)\u223cQG [D (f , g) (cid:54)= 0](cid:12)(cid:12)\n(cid:12)(cid:12)E(f ,g)\u223cPG [D (f , g) = 1] + E(f ,g)\u223cQG [D (f , g) = 0](cid:12)(cid:12) .\n\nd\u2206 (PG, QG) (cid:54) supD\u2208HD\n(cid:54) supD\u2208HD\n\n(13)\n\nThis supremum is achieved in the process of training the optimal discriminator D in CDAN, giving an\nupper bound of d\u2206 (PG, QG). Simultaneously, we learn representation f to minimize d\u2206 (PG, QG),\nyielding better approximation of \u0001Q(G) by \u0001P (G) to bound the target risk in the minimax paradigm.\n4 Experiments\nWe evaluate the proposed conditional domain adversarial networks with many state-of-the-art transfer\nlearning and deep learning methods. Codes will be available at http://github.com/thuml/CDAN.\n4.1 Setup\nOf\ufb01ce-31 [42] is the most widely used dataset for visual domain adaptation, with 4,652 images and\n31 categories collected from three distinct domains: Amazon (A), Webcam (W) and DSLR (D). We\nevaluate all methods on six transfer tasks A \u2192 W, D \u2192 W, W \u2192 D, A \u2192 D, D \u2192 A, and W \u2192 A.\nImageCLEF-DA1 is a dataset organized by selecting the 12 common classes shared by three public\ndatasets (domains): Caltech-256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P). We\npermute all three domains and build six transfer tasks: I \u2192 P, P \u2192 I, I \u2192 C, C \u2192 I, C \u2192 P, P \u2192 C.\nOf\ufb01ce-Home [53] is a better organized but more dif\ufb01cult dataset than Of\ufb01ce-31, which consists of\n15,500 images in 65 object classes in of\ufb01ce and home settings, forming four extremely dissimilar\ndomains: Artistic images (Ar), Clip Art (Cl), Product images (Pr), and Real-World images (Rw).\nDigits We investigate three digits datasets: MNIST, USPS, and Street View House Numbers (SVHN).\nWe adopt the evaluation protocol of CyCADA [22] with three transfer tasks: USPS to MNIST (U\n\u2192 M), MNIST to USPS (M \u2192 U), and SVHN to MNIST (S \u2192 M). We train our model using the\ntraining sets: MNIST (60,000), USPS (7,291), standard SVHN train (73,257). Evaluation is reported\non the standard test sets: MNIST (10,000), USPS (2,007) (the numbers of images are in parentheses).\n\n1http://imageclef.org/2014/adaptation\n\n6\n\n\fVisDA-20172 is a challenging simulation-to-real dataset, with two very distinct domains: Synthetic,\nrenderings of 3D models from different angles and with different lightning conditions; Real, natural\nimages. It contains over 280K images across 12 classes in the training, validation and test domains.\nWe compare Conditional Domain Adversarial Network (CDAN) with state-of-art domain adaptation\nmethods: Deep Adaptation Network (DAN) [29], Residual Transfer Network (RTN) [31], Domain\nAdversarial Neural Network (DANN) [13], Adversarial Discriminative Domain Adaptation (ADDA)\n[51], Joint Adaptation Network (JAN) [30], Unsupervised Image-to-Image Translation (UNIT) [28],\nGenerate to Adapt (GTA) [43], Cycle-Consistent Adversarial Domain Adaptation (CyCADA) [22].\nWe follow the standard protocols for unsupervised domain adaptation [12, 30]. We use all labeled\nsource examples and all unlabeled target examples and compare the average classi\ufb01cation accuracy\nbased on three random experiments. We conduct importance-weighted cross-validation (IWCV) [48]\nto select hyper-parameters for all methods. As CDAN performs stably under different parameters,\nwe \ufb01x \u03bb = 1 for all experiments. For MMD-based methods (TCA, DAN, RTN, and JAN), we use\nGaussian kernel with bandwidth set to median pairwise distances on training data [29]. We adopt\nAlexNet [27] and ResNet-50 [20] as base networks and all methods differ only in their discriminators.\nWe implement AlexNet-based methods in Caffe and ResNet-based methods in PyTorch. We \ufb01ne-\ntune from ImageNet pre-trained models [41], except the digit datasets that we train our models from\nscratch. We train the new layers and classi\ufb01er layer through back-propagation, where the classi\ufb01er is\ntrained from scratch with learning rate 10 times that of the lower layers. We adopt mini-batch SGD\nwith momentum of 0.9 and the learning rate annealing strategy as [13]: the learning rate is adjusted by\n\u2212\u03b2, where p is the training progress changing from 0 to 1, and \u03b70 = 0.01, \u03b1 = 10,\n\u03b7p = \u03b70(1 + \u03b1p)\n\u03b2 = 0.75 are optimized by the importance-weighted cross-validation [48]. We adopt a progressive\ntraining strategy for the discriminator, increasing \u03bb from 0 to 1 by multiplying to 1\u2212exp(\u2212\u03b4p)\n1+exp(\u2212\u03b4p) , \u03b4 = 10.\nTable 1: Accuracy (%) on Of\ufb01ce-31 for unsupervised domain adaptation (AlexNet and ResNet)\nAvg\n70.1\n72.9\n73.7\n74.3\n74.7\n76.0\n77.0\n77.7\n76.1\n80.4\n81.6\n82.2\n82.9\n84.3\n86.5\n86.6\n87.7\n\nA \u2192 W\n61.6\u00b10.5\n68.5\u00b10.5\n73.3\u00b10.3\n73.0\u00b10.5\n73.5\u00b10.6\n74.9\u00b10.3\n77.9\u00b10.3\n78.3\u00b10.2\n68.4\u00b10.2\n80.5\u00b10.4\n84.5\u00b10.2\n82.0\u00b10.4\n86.2\u00b10.5\n85.4\u00b10.3\n89.5\u00b10.5\n93.1\u00b10.2\n94.1\u00b10.1\n\nD \u2192 A\n51.1\u00b10.6\n54.0\u00b10.5\n50.5\u00b10.3\n53.4\u00b10.4\n54.6\u00b10.5\n58.3\u00b10.3\n54.5\u00b10.3\n57.3\u00b10.2\n62.5\u00b10.3\n63.6\u00b10.3\n66.2\u00b10.2\n68.2\u00b10.4\n69.5\u00b10.4\n68.6\u00b10.3\n72.8\u00b10.3\n70.1\u00b10.4\n71.0\u00b10.3\n\nW \u2192 A\n49.8\u00b10.4\n53.1\u00b10.5\n51.0\u00b10.1\n51.2\u00b10.5\n53.5\u00b10.6\n55.0\u00b10.4\n57.5\u00b10.4\n57.3\u00b10.3\n60.7\u00b10.3\n62.8\u00b10.2\n64.8\u00b10.3\n67.4\u00b10.5\n68.9\u00b10.5\n70.0\u00b10.4\n71.4\u00b10.4\n68.0\u00b10.4\n69.3\u00b10.3\n\nA \u2192 D\n63.8\u00b10.5\n67.0\u00b10.4\n71.0\u00b10.2\n72.3\u00b10.3\n71.6\u00b10.4\n71.8\u00b10.2\n75.1\u00b10.2\n76.3\u00b10.1\n68.9\u00b10.2\n78.6\u00b10.2\n77.5\u00b10.3\n79.7\u00b10.4\n77.8\u00b10.3\n84.7\u00b10.3\n87.7\u00b10.5\n89.8\u00b10.3\n92.9\u00b10.2\n\nD \u2192 W\n95.4\u00b10.3\n96.0\u00b10.3\n96.8\u00b10.2\n96.4\u00b10.3\n96.2\u00b10.4\n96.6\u00b10.2\n96.9\u00b10.2\n97.2\u00b10.1\n96.7\u00b10.1\n97.1\u00b10.2\n96.8\u00b10.1\n96.9\u00b10.2\n96.2\u00b10.3\n97.4\u00b10.2\n97.9\u00b10.3\n98.2\u00b10.2\n98.6\u00b10.1\n\nMethod\n\nAlexNet [27]\n\nDAN [29]\nRTN [31]\nDANN [13]\nADDA [51]\nJAN [30]\nCDAN\nCDAN+E\n\nResNet-50 [20]\n\nDAN [29]\nRTN [31]\nDANN [13]\nADDA [51]\nJAN [30]\nGTA [43]\nCDAN\nCDAN+E\n\nW \u2192 D\n99.0\u00b10.2\n99.0\u00b10.3\n99.6\u00b10.1\n99.2\u00b10.3\n98.8\u00b10.4\n99.5\u00b10.2\n100.0\u00b1.0\n100.0\u00b1.0\n99.3\u00b10.1\n99.6\u00b10.1\n99.4\u00b10.1\n99.1\u00b10.1\n98.4\u00b10.3\n99.8\u00b10.2\n99.8\u00b10.4\n100.0\u00b1.0\n100.0\u00b1.0\n\n4.2 Results\nThe results on Of\ufb01ce-31 based on AlexNet and ResNet are reported in Table 1, with results of baselines\ndirectly reported from the original papers if protocol is the same. The CDAN models signi\ufb01cantly\noutperform all comparison methods on most transfer tasks, where CDAN+E is a top-performing\nvariant and CDAN performs slightly worse. It is desirable that CDAN promotes the classi\ufb01cation\naccuracies substantially on hard transfer tasks, e.g. A \u2192 W and A \u2192 D, where the source and target\ndomains are substantially different [42]. Note that, CDAN+E even outperforms generative pixel-level\ndomain adaptation method GTA, which has a very complex design in both architecture and objectives.\nThe results on the ImageCLEF-DA dataset are reported in Table 2. The CDAN models outperform\nthe comparison methods on most transfer tasks, but with smaller rooms of improvement. This is\nreasonable since the three domains in ImageCLEF-DA are of equal size and balanced in each category,\nand are visually more similar than Of\ufb01ce-31, making the former dataset easier for domain adaptation.\n\n2http://ai.bu.edu/visda-2017/\n\n7\n\n\fTable 2: Accuracy (%) on ImageCLEF-DA for unsupervised domain adaptation (AlexNet and ResNet)\n\nI \u2192 P\n66.2\u00b10.2\n67.3\u00b10.2\n66.5\u00b10.6\n67.2\u00b10.5\n67.7\u00b10.3\n67.0\u00b10.4\n74.8\u00b10.3\n74.5\u00b10.4\n75.0\u00b10.6\n76.8\u00b10.4\n76.7\u00b10.3\n77.7\u00b10.3\n\nP \u2192 I\n70.0\u00b10.2\n80.5\u00b10.3\n81.8\u00b10.3\n82.8\u00b10.4\n83.3\u00b10.1\n84.8\u00b10.2\n83.9\u00b10.1\n82.2\u00b10.2\n86.0\u00b10.3\n88.0\u00b10.2\n90.6\u00b10.3\n90.7\u00b10.2\n\nI \u2192 C\n84.3\u00b10.2\n87.7\u00b10.3\n89.0\u00b10.4\n91.3\u00b10.5\n91.8\u00b10.2\n92.4\u00b10.3\n91.5\u00b10.3\n92.8\u00b10.2\n96.2\u00b10.4\n94.7\u00b10.2\n97.0\u00b10.4\n97.7\u00b10.3\n\nC \u2192 I\n71.3\u00b10.4\n76.0\u00b10.3\n79.8\u00b10.6\n80.0\u00b10.5\n81.5\u00b10.2\n81.3\u00b10.3\n78.0\u00b10.2\n86.3\u00b10.4\n87.0\u00b10.5\n89.5\u00b10.3\n90.5\u00b10.4\n91.3\u00b10.3\n\nC \u2192 P\n59.3\u00b10.5\n61.6\u00b10.3\n63.5\u00b10.5\n63.5\u00b10.4\n63.0\u00b10.2\n64.7\u00b10.3\n65.5\u00b10.3\n69.2\u00b10.4\n74.3\u00b10.5\n74.2\u00b10.3\n74.5\u00b10.3\n74.2\u00b10.2\n\nMethod\n\nAlexNet [27]\n\nDAN [29]\nDANN [13]\nJAN [30]\nCDAN\nCDAN+E\n\nResNet-50 [20]\n\nDAN [29]\nDANN [13]\nJAN [30]\nCDAN\nCDAN+E\n\nP \u2192 C\n84.5\u00b10.3\n88.4\u00b10.2\n88.7\u00b10.3\n91.0\u00b10.4\n91.5\u00b10.3\n91.6\u00b10.4\n91.2\u00b10.3\n89.8\u00b10.4\n91.5\u00b10.6\n91.7\u00b10.3\n93.5\u00b10.4\n94.3\u00b10.3\n\nAvg\n73.9\n76.9\n78.2\n79.3\n79.8\n80.3\n80.7\n82.5\n85.0\n85.8\n87.1\n87.7\n\nTable 3: Accuracy (%) on Of\ufb01ce-Home for unsupervised domain adaptation (AlexNet and ResNet)\n\nAr(cid:1)Cl Ar(cid:1)Pr Ar(cid:1)Rw Cl(cid:1)Ar Cl(cid:1)Pr Cl(cid:1)Rw Pr(cid:1)Ar Pr(cid:1)Cl Pr(cid:1)Rw Rw(cid:1)Ar Rw(cid:1)Cl Rw(cid:1)Pr Avg\n\nMethod\n\nAlexNet [27]\n\nDAN [29]\nDANN [13]\nJAN [30]\nCDAN\nCDAN+E\n\n26.4\n31.7\n36.4\n35.5\n36.2\n38.1\nResNet-50 [20] 34.9\n43.6\n45.6\n45.9\n49.0\n50.7\n\nDAN [29]\nDANN [13]\nJAN [30]\nCDAN\nCDAN+E\n\n32.6\n43.2\n45.2\n46.1\n47.3\n50.3\n50.0\n57.0\n59.3\n61.2\n69.3\n70.6\n\n41.3\n55.1\n54.7\n57.7\n58.6\n60.3\n58.0\n67.9\n70.1\n68.9\n74.5\n76.0\n\n22.1\n33.8\n35.2\n36.4\n37.3\n39.7\n37.4\n45.8\n47.0\n50.4\n54.4\n57.6\n\n41.7\n48.6\n51.8\n53.3\n54.4\n56.4\n41.9\n56.5\n58.5\n59.7\n66.0\n70.0\n\n42.1\n50.8\n55.1\n54.5\n58.3\n57.8\n46.2\n60.4\n60.9\n61.0\n68.4\n70.0\n\n20.5\n30.1\n31.6\n33.4\n33.2\n35.5\n38.5\n44.0\n46.1\n45.8\n55.6\n57.4\n\n20.3\n35.1\n39.7\n40.3\n43.9\n43.1\n31.2\n43.6\n43.7\n43.4\n48.3\n50.9\n\n51.1\n57.7\n59.3\n60.1\n62.1\n63.2\n60.4\n67.7\n68.5\n70.3\n75.9\n77.3\n\n31.0\n44.6\n45.7\n45.9\n48.2\n48.4\n53.9\n63.1\n63.2\n63.9\n68.4\n70.9\n\n27.9\n39.3\n46.4\n47.4\n48.1\n48.5\n41.2\n51.5\n51.8\n52.4\n55.4\n56.7\n\n54.9 34.3\n63.7 44.5\n65.9 47.3\n67.9 48.2\n70.7 49.9\n71.1 51.0\n59.9 46.1\n74.3 56.3\n76.8 57.6\n76.8 58.3\n80.5 63.8\n81.6 65.8\n\nTable 4: Accuracy (%) on Digits and VisDA-2017 for unsupervised domain adaptation (ResNet-50)\n\nMethod\nUNIT [28]\n\nCyCADA [22]\n\nCDAN\nCDAN+E\n\nM \u2192 U\n96.0\n95.6\n93.9\n95.6\n\nU \u2192 M\n93.6\n96.5\n96.9\n98.0\n\nS \u2192 M\n90.5\n90.4\n88.5\n89.2\n\nAvg\n93.4\n94.2\n93.1\n94.3\n\nMethod\nJAN [30]\nGTA [43]\nCDAN\nCDAN+E\n\nSynthetic \u2192 Real\n\n61.6\n69.5\n66.8\n70.0\n\nThe results on Of\ufb01ce-Home are reported in Table 3. The CDAN models substantially outperform the\ncomparison methods on most transfer tasks, and with larger rooms of improvement. An interpretation\nis that the four domains in Of\ufb01ce-Home are with more categories, are visually more dissimilar with\neach other, and are dif\ufb01cult in each domain with much lower in-domain classi\ufb01cation accuracy [53].\nSince domain alignment is category agnostic in previous work, it is possible that the aligned domains\nare not classi\ufb01cation friendly in the presence of large number of categories. It is desirable that CDAN\nmodels yield larger boosts on such dif\ufb01cult domain adaptation tasks, which highlights the power of\nadversarial domain adaptation by exploiting complex multimodal structures in classi\ufb01er predictions.\nStrong results are also achieved on the digits datasets and synthetic to real datasets as reported in\nTable 4. Note that the generative pixel-level adaptation methods UNIT, CyCADA, and GTA are\nspeci\ufb01cally tailored to the digits and synthetic to real adaptation tasks. This explains why the previous\nfeature-level adaptation method JAN performs fairly weakly. To our knowledge, CDAN+E is the only\napproach that works reasonably well on all \ufb01ve datasets, and remains a simple discriminative model.\n4.3 Analysis\nAblation Study We examine the sampling strategies of the random matrices in Equation (6). We\ntestify CDAN+E (w/ gaussian sampling) and CDAN+E (w/ uniform sampling) with their random\nmatrices sampled only once from Gaussian and Uniform distributions, respectively. Table 5 shows that\nCDAN+E (w/o random sampling) performs best while CDAN+E (w/ uniform sampling) performs the\nbest across the randomized variants. Table 1\u223c4 shows that CDAN+E outperforms CDAN, proving\nthat entropy conditioning can prioritize easy-to-transfer examples and encourage certain predictions.\nConditioning Strategies Besides multilinear conditioning, we investigate DANN-[f,g] with the\ndomain discriminator imposed on the concatenation of f and g, DANN-f and DANN-g with the\n\n8\n\n\fTable 5: Accuracy (%) of CDAN variants on Of\ufb01ce-31 for unsupervised domain adaptation (ResNet)\n\nMethod\n\nCDAN+E (w/ gaussian sampling)\nCDAN+E (w/ uniform sampling)\nCDAN+E (w/o random sampling)\n\nA \u2192 W D \u2192 W W \u2192 D\n100.0\u00b1.0\n93.0\u00b10.2\n94.0\u00b10.2\n100.0\u00b1.0\n94.1\u00b10.1\n100.0\u00b1.0\n\n98.4\u00b10.2\n98.4\u00b10.2\n98.6\u00b10.1\n\nA \u2192 D\n89.2\u00b10.3\n89.8\u00b10.3\n92.9\u00b10.2\n\nD \u2192 A\n70.2\u00b10.4\n70.1\u00b10.4\n71.0\u00b10.3\n\nW \u2192 A\n67.4\u00b10.4\n69.4\u00b10.4\n69.3\u00b10.3\n\nAvg\n86.4\n87.0\n87.7\n\ndomain discriminator plugged in feature layer f and classi\ufb01er layer g. Figure 2(a) shows accuracies\non A \u2192 W and A \u2192 D based on ResNet-50. The concatenation strategy is not successful, as it cannot\ncapture the cross-covariance between features and classes, which are crucial to domain adaptation [10].\nFigure 2(b) shows that the entropy weight e\u2212H(g) corresponds well with the prediction correctness:\nentropy weight \u2248 1 if the prediction is correct, and much smaller than 1 when prediction is incorrect\n(uncertain). This reveals the power of the entropy conditioning to guarantee example transferability.\n\n(d) Convergence\n(a) Multilinear\nFigure 2: Analysis of conditioning strategies, distribution discrepancy, and convergence.\n\n(c) Discrepancy\n\n(b) Entropy\n\nDistribution Discrepancy The A-distance is a measure for distribution discrepancy [4, 33], de\ufb01ned\nas distA = 2 (1 \u2212 2\u0001), where \u0001 is the test error of a classi\ufb01er trained to discriminate the source from\ntarget. Figure 2(c) shows distA on tasks A \u2192 W, W \u2192 D with features of ResNet, DANN, and\nCDAN. We observe that distA on CDAN features is smaller than distA on both ResNet and DANN\nfeatures, implying that CDAN features can reduce the domain gap more effectively. As domains W\nand D are similar, distA of task W \u2192 D is smaller than that of A \u2192 W, implying higher accuracies.\nConvergence We testify the convergence of ResNet, DANN, and CDANs, with the test errors on\ntask A \u2192 W shown in Figure 2(d). CDAN enjoys faster convergence than DANN, while CDAN (M)\nconverges faster than CDAN (RM). Note that CDAN (M) constitutes high-dimensional multilinear\nmap, which is slightly more costly than CDAN (RM), while CDAN (RM) has similar cost as DANN.\n\n(a) ResNet\n\n(b) DANN\n\n(c) CDAN-f\n\n(d) CDAN-fg\n\nFigure 3: T-SNE of (a) ResNet, (b) DANN, (c) CDAN-f, (d) CDAN-fg (red: A; blue: W).\n\nVisualization We visualize by t-SNE [32] in Figures 3(a)\u20133(d) the representations of task A \u2192 W\n(31 classes) by ResNet, DANN, CDAN-f, and CDAN-fg. The source and target are not aligned well\nwith ResNet, better aligned with DANN but categories are not discriminated well. They are aligned\nbetter and categories are discriminated better by CDAN-f, while CDAN-fg is evidently better than\nCDAN-f. This shows the bene\ufb01t of conditioning adversarial adaptation on discriminative predictions.\n\n5 Conclusion\nThis paper presented conditional domain adversarial networks (CDANs), novel approaches to domain\nadaptation with multimodal distributions. Unlike previous adversarial adaptation methods that solely\nmatch the feature representation across domains which is prone to under-matching, the proposed\napproach further conditions the adversarial domain adaptation on discriminative information to enable\nalignment of multimodal distributions. Experiments validated the ef\ufb01cacy of the proposed approach.\n\nAcknowledgments\nWe thank Yuchen Zhang at Tsinghua University for insightful discussions. This work was supported\nby the National Key R&D Program of China (2016YFB1000701), the Natural Science Foundation of\nChina (61772299, 71690231, 61502265) and the DARPA Program on Lifelong Learning Machines.\n\n9\n\nA->WA->DTransfer Task7580859095100105AccuracyDANN-f (representation)DANN-g (prediction)DANN-[f,g] (concatenation)CDAN (w/o entropy)CDAN-E (w/ entropy)122436486072Example ID00.511.5Entropy WeightCorrect Prediction?Entropy e-H(g)A->WW->DTransfer Task11.21.41.61.82A-DistanceResNetDANNCDAN0.051 2 3 4 5 6 Number of Iterations (104)0 0.050.1 0.150.2 0.250.3 0.350.4 0.450.5 Test ErrorCDAN (M)CDAN (RM)DANNResNet\fReferences\n[1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. In International Conference on Machine\n\nLearning (ICML), 2017.\n\n[3] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial\n\nnets (gans). In International Conference on Machine Learning (ICML), 2017.\n\n[4] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning\n\nfrom different domains. Machine Learning, 79(1-2):151\u2013175, 2010.\n\n[5] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2007.\n\n[6] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(8):1798\u20131828, 2013.\n\n[7] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li. Mode regularized generative adversarial networks.\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[8] Y. Chen, W. Chen, Y. Chen, B. Tsai, Y. F. Wang, and M. Sun. No more discrimination: Cross city\nadaptation of road scene segmenters. In The IEEE International Conference on Computer Vision (ICCV),\npages 2011\u20132020, 2017.\n\n[9] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing\n\n(almost) from scratch. Journal of Machine Learning Research (JMLR), 12:2493\u20132537, 2011.\n\n[10] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy. Joint distribution optimal transportation for\ndomain adaptation. In Advances in Neural Information Processing Systems (NIPS), pages 3730\u20133739.\n2017.\n\n[11] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional\nactivation feature for generic visual recognition. In International Conference on Machine Learning (ICML),\n2014.\n\n[12] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International\n\nConference on Machine Learning (ICML), 2015.\n\n[13] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky.\nDomain-adversarial training of neural networks. The Journal of Machine Learning Research (JMLR),\n17(1):2096\u20132030, 2016.\n\n[14] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classi\ufb01cation: A deep\n\nlearning approach. In International Conference on Machine Learning (ICML), 2011.\n\n[15] B. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: Discriminatively learning domain-\ninvariant features for unsupervised domain adaptation. In International Conference on Machine Learning\n(ICML), 2013.\n\n[16] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation. In\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.\n\n[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[18] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach.\n\nIn IEEE International Conference on Computer Vision (ICCV), 2011.\n\n[19] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2005.\n\n[20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference\n\non Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[21] J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko. LSDA:\nLarge scale detection through adaptation. In Advances in Neural Information Processing Systems (NIPS),\n2014.\n\n[22] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. CyCADA: Cycle-\nconsistent adversarial domain adaptation. In Proceedings of the 35th International Conference on Machine\nLearning, pages 1989\u20131998, 2018.\n\n[23] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based\n\nadaptation. CoRR, abs/1612.02649, 2016.\n\n[24] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Sch\u00f6lkopf. Correcting sample selection bias\n\nby unlabeled data. In Advances in Neural Information Processing Systems (NIPS), 2006.\n\n[25] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial\n\nnetworks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[26] P. Kar and H. Karnick. Random feature maps for dot product kernels. In International Conference on\n\nArti\ufb01cial Intelligence and Statistics (AISTATS), volume 22, pages 583\u2013591, 2012.\n\n[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems (NIPS), 2012.\n\n10\n\n\f[28] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 700\u2013708. 2017.\n\n[29] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks.\n\nIn International Conference on Machine Learning (ICML), 2015.\n\n[30] M. Long, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In International\n\nConference on Machine Learning (ICML), 2017.\n\n[31] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual transfer\n\nnetworks. In Advances in Neural Information Processing Systems (NIPS), pages 136\u2013144, 2016.\n\n[32] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research\n\n(JMLR), 9(Nov):2579\u20132605, 2008.\n\n[33] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In\n\nConference on Computational Learning Theory (COLT), 2009.\n\n[34] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.\n[35] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classi\ufb01er gans. In Interna-\n\ntional Conference on Machine Learning (ICML), 2017.\n\n[36] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations\nusing convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2013.\n\n[37] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE\n\nTransactions on Neural Networks (TNN), 22(2):199\u2013210, 2011.\n\n[38] S. J. Pan and Q. Yang. A survey on transfer learning.\n\nIEEE Transactions on Knowledge and Data\n\nEngineering (TKDE), 22(10):1345\u20131359, 2010.\n\n[39] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset shift in machine\n\nlearning. The MIT Press, 2009.\n\n[40] A. Rahimi and B. Recht. Random features for large-scale kernel machines.\n\ninformation processing systems (NIPS), pages 1177\u20131184, 2008.\n\nIn Advances in neural\n\n[41] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\n\nM. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. 2014.\n\n[42] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In\n\nEuropean Conference on Computer Vision (ECCV), 2010.\n\n[43] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning domains us-\ning generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), June 2018.\n\n[44] L. Song, B. Boots, S. M. Siddiqi, G. J. Gordon, and A. Smola. Hilbert space embeddings of hidden markov\n\nmodels. In International Conference on Machine Learning (ICML), 2010.\n\n[45] L. Song and B. Dai. Robust low rank kernel embeddings of multivariate distributions. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 3228\u20133236, 2013.\n\n[46] L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A uni\ufb01ed\nkernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine,\n30(4):98\u2013111, 2013.\n\n[47] L. Song, J. Huang, A. Smola, and K. Fukumizu. Hilbert space embeddings of conditional distributions\nwith applications to dynamical systems. In International Conference on Machine Learning (ICML), 2009.\n[48] M. Sugiyama, M. Krauledat, and K.-R. Muller. Covariate shift adaptation by importance weighted cross\n\nvalidation. Journal of Machine Learning Research (JMLR), 8(May):985\u20131005, 2007.\n\n[49] M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawanabe. Direct importance estimation\nwith model selection and its application to covariate shift adaptation. In Advances in Neural Information\nProcessing Systems (NIPS), 2008.\n\n[50] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker. Learning to adapt structured\noutput space for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2018.\n\n[51] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[52] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Simultaneous deep transfer across domains\n\nand tasks. In IEEE International Conference on Computer Vision (ICCV), 2015.\n\n[53] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for unsupervised\n\ndomain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[54] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In\n\nAdvances in Neural Information Processing Systems (NIPS), 2014.\n\n[55] K. Zhang, B. Sch\u00f6lkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift.\n\nIn International Conference on Machine Learning (ICML), 2013.\n\n[56] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent\n\nadversarial networks. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.\n\n11\n\n\f", "award": [], "sourceid": 836, "authors": [{"given_name": "Mingsheng", "family_name": "Long", "institution": "Tsinghua University"}, {"given_name": "ZHANGJIE", "family_name": "CAO", "institution": "Stanford University"}, {"given_name": "Jianmin", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}