{"title": "Transferable Normalization: Towards Improving Transferability of Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1953, "page_last": 1963, "abstract": "Deep neural networks (DNNs) excel at learning representations when trained on large-scale datasets. Pre-trained DNNs also show strong transferability when fine-tuned to other labeled datasets. However, such transferability becomes weak when the target dataset is fully unlabeled as in Unsupervised Domain Adaptation (UDA). We envision that the loss of transferability may stem from the intrinsic limitation of the architecture design of DNNs. In this paper, we delve into the components of DNN architectures and propose Transferable Normalization (TransNorm) in place of existing normalization techniques. TransNorm is an end-to-end trainable layer to make DNNs more transferable across domains. As a general method, TransNorm can be easily applied to various deep neural networks and domain adaption methods, without introducing any extra hyper-parameters or learnable parameters. Empirical results justify that TransNorm not only improves classification accuracies but also accelerates convergence for mainstream DNN-based domain adaptation methods.", "full_text": "Transferable Normalization: Towards Improving\n\nTransferability of Deep Neural Networks\n\nXimei Wang, Ying Jin, Mingsheng Long ((cid:66))\u2217, Jianmin Wang, and Michael I. Jordan(cid:93)\n\nSchool of Software, BNRist, Tsinghua University, China\nResearch Center for Big Data, Tsinghua University, China\nNational Engineering Laboratory for Big Data Software\n\n(cid:93)University of California, Berkeley, Berkeley, USA\n\n{wxm17,jiny18}@mails.tsinghua.edu.cn\n\n{mingsheng,jimwang}@tsinghua.edu.cn\n\njordan@cs.berkeley.edu\n\nAbstract\n\nDeep neural networks (DNNs) excel at learning representations when trained on\nlarge-scale datasets. Pre-trained DNNs also show strong transferability when \ufb01ne-\ntuned to other labeled datasets. However, such transferability becomes weak when\nthe target dataset is fully unlabeled as in Unsupervised Domain Adaptation (UDA).\nWe envision that the loss of transferability mainly stems from the intrinsic limitation\nof the architecture design of DNNs. In this paper, we delve into the components of\nDNN architectures and propose Transferable Normalization (TransNorm) in place\nof existing normalization techniques. TransNorm is an end-to-end trainable layer to\nmake DNNs more transferable across domains. As a general method, TransNorm\ncan be easily applied to various deep neural networks and domain adaption methods,\nwithout introducing any extra hyper-parameters or learnable parameters. Empirical\nresults justify that TransNorm not only improves classi\ufb01cation accuracies but also\naccelerates convergence for mainstream DNN-based domain adaptation methods.\n\nIntroduction\n\n1\nDeep neural networks (DNNs) have dramatically advanced the state of the art in machine learning and\nexcel at learning discriminative representations that are useful for a wide range of tasks [24, 3, 44].\nHowever, in real-world scenarios where gaining suf\ufb01cient labeled data through manual labeling is\nintolerably time-consuming and labor-exhausting, DNNs may be confronted with challenges when\ngeneralizing the pre-trained model to a different unlabeled dataset. This dilemma has motivated the\nresearch on domain adaptation [25], aiming to establish versatile algorithms that transfer knowledge\nfrom a different but related dataset by leveraging its readily-available labeled data.\nEarly domain adaptation methods strived to learn domain-invariant representations [25, 6] or instance\nimportances [11, 5] to bridge the source and target domains. Since deep neural networks (DNNs)\nbecame successful, domain adaptation methods also leveraged them for representation learning.\nDisentangling explanatory factors of variations, DNNs are able to learn more transferable features\n[24, 3, 44, 47]. Recent works in deep domain adaptation embed discrepancy measures into deep\narchitectures to align feature distributions across domains. Commonly, they minimize the distribution\ndiscrepancy between feature distributions [38, 17, 19, 20, 16] or adversarially learn transferable\nfeature representations to fool a domain discriminator in a two-player game [36, 4, 37, 26, 18, 41, 46].\nTo date there have been extensive works that tackle domain adaptation by reducing the domain shift\nfrom the perspective of loss function designs. They are admittedly quite effective but constitute only\n\n\u2217Corresponding author: Mingsheng Long (mingsheng@tsinghua.edu.cn)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fone side of the coin. Rare attention has been paid to the other side of the coin, i.e. network design,\nwhich is fundamental to deep neural networks and also an indispensable part of domain adaptation.\nFew previous works have delved into the transferability of the involved network backbone, of the\ninternal components and most importantly, into the intrinsic limitation of these internal components.\nTo understand the impact of each internal component on the transferability of the whole deep networks,\nwe delve into ResNet [9], one of the most elegant backbones commonly used in domain adaptation. In\nResNet, convolutional layers contribute to feature transformation and abstraction, while pooling layers\ncontribute to enhancing the translation invariance. However, the invariance they achieve cannot bridge\nthe domain gap in domain adaptation since the domains are signi\ufb01cantly different [13]. By contrast,\nBatch Normalization (BN) is crucial for speeding up training as well as enhancing model performance\nby smoothing loss landscapes [34]. It normalizes inputs to be zero-mean and univariance and then\nscales and shifts the normalized vectors with two trainable parameters to maintain expressiveness.\nIntuitively, BN shares the spirit of statistics (moments) matching [17] with domain adaptation.\nHowever, when BN is applied to cross-domain scenarios, its design is not optimal for improving the\ntransferability of DNNs. At each BN layer, the representations from the source and target domains are\ncombined into a single batch and fed as the input of the BN layer. In this way, the suf\ufb01cient statistics\n(including mean and standard deviation) of the BN layer will be calculated on this combined batch,\nwithout considering to which domain these representations belong. In fact, owing to the phenomenon\nknown as dataset bias or domain shift [29], the distributions from source and target representations\nare signi\ufb01cantly different, which implies a signi\ufb01cant difference in their means and variances between\ndomains. Hence, a brute-force sharing of the mean and variance across domains as in the BN layers\nmay distort the feature distributions and thereby deteriorate the network transferability.\nFurther, the design that all channels in BN are equally important is unsuitable for domain adaptation. It\nis clear that some channels are easier to transfer than others since they constitute the sharing patterns of\nboth domains. While those more transferable channels should be highlighted, it remains unclear how\nthis can be implemented in an effective way, retaining the simplicity of BN while not changing other\ncomponents of network architectures. Apparently, the inability to adapt representations according to\nchannel transferability constitutes a major bottleneck in designing transferable architectures.\nThis paper delves into the components of DNN architectures and reveals that BN is the constraint\nof network transferability. We thus propose Transferable Normalization (TransNorm) in place of\nexisting normalization techniques. TransNorm is an end-to-end trainable layer to make DNNs more\ntransferable across domains. As a simple and general method, TransNorm can be easily plugged into\nvarious deep neural networks and domain adaption methods, free from changing the other network\nmodules and from introducing any extra hyper-parameters or learnable parameters. Empirical results\non \ufb01ve standard domain adaptation datasets justify that TransNorm not only improves classi\ufb01cation\naccuracies but also accelerates convergence for state of the art domain adaptation methods on DNNs.\n\n2 Related Work\nDomain Adaptation. Domain adaptation overcomes the shift across the source and target domains\n[25]. Common methods of domain adaptation fall into two classes: moment matching and adversarial\ntraining. Moment matching methods align feature distributions by minimizing the distribution discrep-\nancy in the feature space, where DAN [17] and DDC [38] employ Maximum Mean Discrepancy [8]\nand JAN [20] uses Joint Maximum Mean Discrepancy. Adversarial training becomes very successful\nsince the seminal work of Generative Adversarial Networks (GAN) [7], in which the generator and\ndiscriminator are improved through competing against each other in a two-player minimax game.\nInspired by the idea of adversarial learning [7], DANN [4] introduces the domain discriminator which\ndistinguishes the source and the target, while the feature extractor is trained to fool the discriminator.\nSince DANN was proposed, there were many attempts to further improve it. Motivated by Conditional\nGAN [23], CDAN [18] trains deep networks in a conditional domain-adversarial paradigm. MADA\n[26] extends DANN to multiple discriminators and pursues multimodal distribution alignment. ADDA\n[37] adopts asymmetric feature extractors for each domain while MCD [32] makes two classi\ufb01ers\nconsistent across domains. MDD [46] is the latest state of the art approach instantiating a new domain\nadaptation margin theory based on the hypothesis-induced discrepancy.\nNormalization Techniques. Normalization techniques are widely embedded in DNN architectures.\nAmong them, Batch Normalization (BN) [12] has the strong ability to accelerate and stabilize training\nby controlling the \ufb01rst and second moments of inputs. It transforms inputs of every layer to have\n\n2\n\n\fzero-mean and univariance and then scales and shifts them with two trainable parameters to uncover\nthe identity transform. Such a simple mechanism has been applied to most deep network architectures.\nAt \ufb01rst, BN was believed to stabilize training by reducing the Internal Covariate Shift (ICS), while\nrecent works proved that the ef\ufb01cacy of BN stems from smoother loss landscapes [34]. Meanwhile,\nBN has various variants, such as Layer Normalization [1] and Group Normalization [43].\nThere is no doubt that Batch Normalization is among the most successful innovations in deep neural\nnetworks, not only as a training method but also as a crucial component of the network backbone.\nHowever, in each BN layer, the representations from the source and target domains are combined\ninto a single batch to feed as the input of the BN layer. In this way, BN shares mean and variance\nacross two domains, which is clearly unreasonable due to domain shift. Researchers have realized this\nproblem, and proposed AdaBN [15] and AutoDIAL [22]. AdaBN uses target statistics at inference to\nreduce the domain shift, but these statistics are excluded from the training procedure. AutoDIAL\ninjects a linear combination of the source and target features to BN at the training stage, in which an\nextra parameter is involved in each BN layer as the trade-off between the source and target domains.\nFurther, there has been no work on modeling the channel-wise transferability in deep neural networks.\n\n3 Approach\nThis paper aims at improving transferability of deep neural networks in the context of Unsupervised\nDomain Adaptation (UDA). With a labeled source domain Ds = {(xs,i, ls,i)}ns\ni=1 and an unlabeled\ntarget domain Dt = {xt,i}nt\ni=1 where xi is an example and li is the associated label, UDA usually\ntrains a deep network that generalizes well to the unlabeled target data. The discussion in Section 1\nelaborates the need of developing a Transferable Normalization (TransNorm) layer for the network\nbackbone to enable domain adaptation, taking advantage of the moment matching mechanism in Batch\nNormalization (BN). The goal of TransNorm is to improve the transferability of deep networks with\nsimple and ef\ufb01cient network layers easily pluggable into the network backbones of existing domain\nadaptation methods, by simply replacing their internal BN layers with our TransNorm layers. To\nsmooth the presentation of TransNorm, we will \ufb01rst describe the preliminary of Batch Normalization.\n3.1 Batch Normalization (BN)\nMotivated by the wide practice that network training converges faster if its inputs are whitened [14, 42],\nBatch Normalization (BN) [12] was designed to transform the features of each layer to zero-mean and\nunivariance. In order to reduce computation cost, it is implemented with two necessary simpli\ufb01cations:\nnormalize each scalar feature (channel) independently in each mini-batch. First, it computes two\nbasic statistics: mean \u00b5(j) and variance \u03c32(j) for each channel j over a mini-batch of training data:\n\nm(cid:88)\n\ni=1\n\nm(cid:88)\n\ni=1\n\n\u00b5(j) =\n\n1\nm\n\nx(j)\ni\n\n,\n\n\u03c32(j)\n\n=\n\n1\nm\n\ni \u2212 \u00b5(j))2,\n(x(j)\n\n(1)\n\n(2)\n\nwhere m is the batch size. Then the statistics are used to normalize the input of each layer to have\nzero-mean and univariance. After that, a pair of learnable parameters gamma \u03b3(j) and beta \u03b2(j) are\nlearned for each channel j to scale and shift the normalized value to uncover the identity transform:\n\n(cid:98)x(j) =\n\n(cid:112)\n\nx(j) \u2212 \u00b5(j)\n\u03c32(j) + \u0001\n\n,\n\ny(j) = \u03b3(j)(cid:98)x(j) + \u03b2(j),\n\nwhere \u0001 is a constant added to the mini-batch variance for numerical stability. In this way, the Batch\nNormalizing Transform can successfully accelerate and stabilize training. However, it is somewhat\nunreasonable to combine the inputs of the source and target domains into a single batch since they\nfollow different distributions. Furthermore, all channels are weighted with equal importance in BN,\nwhich is unsuitable for domain adaptation since some channels are more transferable than the others.\n3.2 Transferable Normalization (TransNorm)\nTowards the aforementioned challenges when applying Batch Normalization in domain adaptation,\nwe propose Transferable Normalization (TransNorm) to improve the transferability of deep neural\nnetworks. Following the logistics of basic statistics, learnable parameters and intrinsic mechanism,\nwe delve into BN step by step and replace its negative components towards transferability.\nDomain Speci\ufb01c Mean and Variance.\nIn existing domain adaptation methods, the representations\nfrom the source and target domains are combined into a batch to feed as the input of each BN layer.\n\n3\n\n\fFigure 1: The architecture of Transferable Normalization (TransNorm).\n\ns , \u03c32(j)\n\nHowever, due to the phenomenon known as dataset bias or domain shift [29], the representation\ndistributions between source and target domains are quite different, leading to substantial difference\nbetween source statistics u(j)\n. Hence, a brute-force sharing of\nthe mean and variance across domains will deteriorate transferability, making BN, the only component\nfor statistics matching that potentially boosts transferability, fail to reduce the domain shift.\nWe mitigate this limitation of BN by normalizing the representations from source and target domains\nseparately within each normalization layer. That is, the domain speci\ufb01c mean and variance of each\nchannel are computed separately at \ufb01rst, and then in each layer we normalize the inputs independently\nfor each domain as follows (lets focus on a particular activation x(j) and omit j for clarity):\n\nand target statistics u(j)\n\n, \u03c32(j)\n\ns\n\nt\n\nt\n\n,\n\n(3)\n\n(cid:98)xs =\n\n(cid:112)\u03c32\n\nxs \u2212 us\ns + \u0001\n\n,\n\n(cid:98)xt =\n\n(cid:112)\u03c32\n\nxt \u2212 ut\nt + \u0001\n\nwhere \u0001 is a constant added to the mini-batch variance for numerical stability. Appearing similar to\nAdaBN [15] that also separately calculates mean and variance of source and target domains, however,\nAdaBN only calculates statistics of the source domain at training, and then modulates that of target\ndomain at inference. Admittedly, it can partially mitigate the limitation of BN, but the statistics\nof the target domain with important discriminative information remain unexploited at training. By\ncontrast, both the statistics of source and target domains are applied in TransNorm at both training\nand inference, making it successfully capture the suf\ufb01cient statistics of both domains.\nDomain Sharing Gamma and Beta. With domain speci\ufb01c normalization, the inputs of TransNorm\nlayer for each channel are normalized to have zero-mean and univariance. Note that simply normal-\nizing each channel of a layer for different domains may lead to signi\ufb01cant loss of expressiveness.\nTherefore, it is important to make sure that the transformation inserted in the neural network uncovers\nthe identity transform which is shown to be indispensable in BN [12]. Akin to BN, we reuse the pair\nof learnable parameters gamma (\u03b3) and beta (\u03b2) to scale and shift the normalized values as follows:\n(4)\nAs the inputs of both domains are normalized to have zero-mean and univariance, the domain shift in\nthe low-order statistics is partially reduced. Thus it becomes more reasonable to share parameters\ngamma (\u03b3) and beta (\u03b2) across domains. This appears similar to AutoDIAL [22] which also shares\nthese two parameters across domains. However, AutoDIAL requires extra learnable parameters to\nalign the source and target domains while TransNorm requires no additional parameters.\nDomain Adaptive Alpha. Both domain speci\ufb01c mean and variance and domain sharing gamma\nand beta process all channels uniformly. But it is intuitive that different channels capture different\npatterns. While some patterns are sharable across domains, others are not. Hence, each channel may\nembody different transferability across domains. A natural question arises: Can we quantify the\n\nzs = \u03b3(cid:98)xs + \u03b2,\n\nzt = \u03b3(cid:98)xt + \u03b2,\n\n4\n\n\u2a01\ud835\udc65#\ud835\udc65$\ud835\udc62#,\ud835\udf0e#\ud835\udc62$,\ud835\udf0e$\ud835\udc51())=\ud835\udc62#())\ud835\udf0e#,())+\ud835\udf16\t\u2212\ud835\udc62$())\ud835\udf0e$,())+\ud835\udf16\ud835\udc651$=(\ud835\udc65$\u2212\t\ud835\udc62$)\ud835\udf0e$,+\ud835\udf16\ud835\udefc())=\ud835\udc501+\ud835\udc51)56\u22111+\ud835\udc5185698:6\ud835\udc67#=\ud835\udefe\ud835\udc651#+\ud835\udefd\ud835\udc67$=\ud835\udefe\ud835\udc651$+\ud835\udefd\ud835\udc661#=\ud835\udefc\ud835\udc67#,\ud835\udc661$=\ud835\udefc\ud835\udc67$\ud835\udc651#=(\ud835\udc65#\u2212\t\ud835\udc62#)\ud835\udf0e#,+\ud835\udf16\ud835\udc66#\ud835\udc66$\fAlgorithm 1 Transferable Normalization (TransNorm)\n\nInput: Values of x in a mini-batch from source domain Ds = {xs,i}m\n\nDt = {xt,i}m\nParameters shared across domains to be learned: \u03b3, \u03b2.\n\ni=1, the size m of the mini-batch and the number c of channels in each layer.\n\ni=1 and target domain\n\n{yt,i = TransNorm\u03b3,\u03b2(xt,i)}.\n\n(cid:80)m\ni=1 (xt,i \u2212 \u00b5t)2\n\nm\n\nm\n\nm\n\nm\n\n\u00b5(j)\n\u03c32\ns\n\nd(j) \u2190\n\nOutput: {ys,i = TransNorm\u03b3,\u03b2(xs,i)},\n\u00b5s \u2190 1\ns \u2190 1\n\u03c32\n\n(cid:80)m\ni=1 xs,i, \u00b5t \u2190 1\n(cid:80)m\ni=1 (xs,i \u2212 \u00b5s)2,\nt\u221a\n\u2212 \u00b5(j)\n\n(cid:80)m\ni=1 xt,i\n(cid:12)(cid:12)(cid:12)(cid:12),\nt \u2190 1\n\u03c32\n\n\u03b1(j) \u2190 c(1+d(j))\u22121\n\n(cid:12)(cid:12)(cid:12)(cid:12)\ns\u221a\n(cid:80)c\nk=1 (1+d(k))\u22121 ,\n(cid:98)xs \u2190 xs\u2212us\u221a\nys,i \u2190 (1+\u03b1)(\u03b3(cid:98)xs,i +\u03b2) \u2261 TransNorm\u03b3,\u03b2(xs,i)\nyt,i \u2190 (1 + \u03b1)(\u03b3(cid:98)xt,i + \u03b2) \u2261 TransNorm\u03b3,\u03b2(xt,i)\n\n(cid:98)xt \u2190 xt\u2212ut\u221a\n\nj = 1, 2, ..., c\n\nj = 1, 2, ..., c\n\n(j)+\u0001\n\n\u03c32\n\ns +\u0001\n\n\u03c32\n\nt +\u0001\n\n(j)+\u0001\n\n\u03c32\nt\n\n,\n\n// mini-batch means of source and target domains\n\n// mini-batch variances for domains\n\n// statistics discrepancy for each channel\n\n// transferability value for each channel\n\n// normalize the source and target domains separately\n\n// scale, shift and adapt in source domain\n\n// scale, shift and adapt in target domain\n\ntransferability of different channels and adapt them differently across domains? This paper gives a\npositive answer. We reuse the domain-speci\ufb01c statistics of means and variances before normalization\nto select those more transferable channels. These statistics, with channel-wise distribution information,\nenable us to quantify the domain distance d(j) for each channel j in a parameter-free paradigm:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nd(j) =\n\ns(cid:113)\n\n\u00b5(j)\n\nt(cid:113)\n\n\u00b5(j)\n\n\u2212\n\n(j) + \u0001\n\n\u03c32\ns\n\n(j) + \u0001\n\n\u03c32\nt\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ,\n\nj = 1, 2, ..., c,\n\n(5)\n\n(cid:80)c\n\nc(1 + d(j))\u22121\nk=1 (1 + d(k))\u22121\n\nwhere c denotes the number of channels in the layer that TransNorm applies to. Reusing the previously\ncalculated statistics, this calculation requires a negligible computational cost.\nWhile the channel-wise distance d(j) naturally quanti\ufb01es the inverse of transferability, we need to\nchange it into a probabilistic weighting scheme. Motivated by the well-established t-SNE [21], we\nuse Student t-distribution [35] with one degree of freedom to convert the computed distances to proba-\nbilities. This distribution, with heavier tails than alternatives such as Gaussian distribution, is suitable\nfor highlighting transferable channels as well as avoiding overly penalizing the others. We compute\nthe distance-based probability alpha (\u03b1) to adapt each channel according to its transferability:\n\n\u03b1(j) =\n\n,\n\nj = 1, 2, ..., c,\n\n(6)\n\nwhere c is the channel size. In this way, the statistics computed before normalization are reused to\nquantify channel transferability, and \ufb01nally, the channels with higher transferability will be assigned\nwith higher importance for domain adaptation, enabling better learning of transferable patterns. To\nfurther avoid overly penalizing the informative channels, we use a residual connection to combine\nthe normalized values in Eq. (4) with the transferability-weighted one to form the \ufb01nal output of the\nTransNorm layer:\n\nyt = (1 + \u03b1)zt.\n\nys = (1 + \u03b1)zs,\n\n(7)\nTransNorm Layer. The overall process of TransNorm is summarized in Algorithm 1. Integrating\nthe above explanation, TransNorm is designed to be an end-to-end trainable layer to improve the\ntransferability of deep neural networks. As shown in Figure 1, TransNorm calculates statistics of\ninputs from the source and target domains separately while the channel transferability is computed in\nparallel. After normalized, features go through channel adaptive mechanisms to re-weight channels\naccording to their transferability. It is quite crucial that the channel transferability strength shall\nbe evaluated with statistics of layer input and assigned to the normalized features. In addition, the\nresidual mechanism [9] is also included in TransNorm to make it more robust to subtle channels.\n\n5\n\n\f(a) CNN\n\n(b) Residual Block\n\n(c) Inception Block\n\nFigure 2: Improving transferability of mainstream deep backbones by replacing BN with TransNorm.\n\nImproving Transferability of Deep Neural Networks\n\n3.3\nAs a general and lightweight normalization layer, TransNorm can be easily plugged into common\nnetwork backbones in place of the standard batch normalization, whenever they are pre-trained or\ntrained from scratch. With stronger network backbones, most domain adaptation methods can be\nimproved signi\ufb01cantly without changing their loss functions. As shown in Figure 2, we can apply\nTransNorm to different kinds of deep neural networks, from vanilla CNN to ResNet to Inception Net,\nwithout any other modi\ufb01cations to the network architecture or introducing any additional parameters.\n\n4 Experiments\nWe evaluate TransNorm with two series of experiments: (a) Basic experiments of applying TransNorm\nto the ResNet-50 backbone for seminal domain adaptation methods DANN [4] and CDAN [18] on\nthree standard datasets; (b) Generalize to a wider variety of domain adaptation methods, more kinds of\nnetwork backbones, and more challenging datasets. Experimental results indicate that our TransNorm\ncan unanimously improve the transferability of the state of the art domain adaptation methods over \ufb01ve\nstandard datasets. The code of TransNorm is available at http://github.com/thuml/TransNorm.\n\n4.1 Setup\nWe use \ufb01ve domain adaptation datasets: Of\ufb01ce-31 [31] with 31 categories and 4, 652 images collected\nfrom three domains: Amazon (A), DSLR (D) and Webcam (W); ImageCLEF-DA with 12 classes\nshared by three public datasets (domains): Caltech-256 (C), ImageNet ILSVRC 2012 (I), and Pascal\nVOC 2012 (P); Of\ufb01ce-Home [39] with 65 classes and 15, 500 image from four signi\ufb01cantly different\ndomains: Artistic images (Ar), Clip Art (Cl), Product images (Pr) and Real-World images (Rw);\nDigits dataset with images from MNIST (M) and Street View House Numbers (SVHN, S); and\nVisDA-2017 [27], a simulation-to-real dataset involving over 280K images from 12 categories. Our\nmethods were implemented based on PyTorch, employing ResNet-50 [9] models pre-trained on the\nImageNet dataset [30]. We follow the standard protocols for unsupervised domain adaptation[4, 20].\nWe conduct Deep Embedded Validation (DEV) [45] to select the hyper-parameters for all methods.\n\nTable 1: Accuracy (%) on Of\ufb01ce-31 for unsupervised domain adaption (ResNet-50).\nW\u2192A\n\nD\u2192A\n\nA\u2192D\n\nA\u2192W\nD\u2192W\n68.4 \u00b1 0.2 96.7 \u00b1 0.1\n80.5 \u00b1 0.4 97.1 \u00b1 0.2\n84.5 \u00b1 0.2 96.8 \u00b1 0.1\n86.2 \u00b1 0.5 96.2 \u00b1 0.3\n85.4 \u00b1 0.3 97.4 \u00b1 0.2\n90.0 \u00b1 0.1 97.4 \u00b1 0.1\n88.6 \u00b1 0.5 98.2 \u00b1 0.2\n89.5 \u00b1 0.5 97.9 \u00b1 0.3\n82.0 \u00b1 0.4 96.9 \u00b1 0.2\n\nMethod\nAvg\n68.9 \u00b1 0.2 62.5 \u00b1 0.3 60.7 \u00b1 0.3 76.1\nResNet-50 [9]\n78.6 \u00b1 0.2 63.6 \u00b1 0.3 62.8 \u00b1 0.2 80.4\nDAN [17]\n77.5 \u00b1 0.3 66.2 \u00b1 0.2 64.8 \u00b1 0.3 81.6\nRTN [19]\n77.8 \u00b1 0.3 69.5 \u00b1 0.4 68.9 \u00b1 0.5 82.9\nADDA [37]\n84.7 \u00b1 0.3 68.6 \u00b1 0.3 70.0 \u00b1 0.4 84.3\nJAN [20]\n87.8 \u00b1 0.2 70.3 \u00b1 0.3 66.4 \u00b1 0.3 85.2\nMADA [26]\n85.3 \u00b1 0.3 73.4 \u00b1 0.8 71.6 \u00b1 0.6 86.2\nSimNet [28]\n87.7 \u00b1 0.5 72.8 \u00b1 0.3 71.4 \u00b1 0.4 86.5\nGTA [33]\n79.7 \u00b1 0.4 68.2 \u00b1 0.4 67.4 \u00b1 0.5 82.2\nDANN+BN [4]\nDANN+TransNorm 91.8 \u00b1 0.4 97.7 \u00b1 0.2 100.0 \u00b1 0.0 88.0 \u00b1 0.2 68.2 \u00b1 0.2 70.4 \u00b1 0.3 86.0\n94.1 \u00b1 0.1 98.6 \u00b1 0.1 100.0 \u00b1 0.0 92.9 \u00b1 0.2 71.0 \u00b1 0.3 69.3 \u00b1 0.3 87.7\nCDAN+BN [18]\nCDAN+TransNorm 95.7 \u00b1 0.3 98.7 \u00b1 0.2 100.0 \u00b1 0.0 94.0 \u00b1 0.2 73.4 \u00b1 0.4 74.2 \u00b1 0.3 89.3\n\nW\u2192D\n99.3 \u00b1 0.1\n99.6 \u00b1 0.1\n99.4 \u00b1 0.1\n98.4 \u00b1 0.3\n99.8 \u00b1 0.2\n99.6 \u00b1 0.1\n99.7 \u00b1 0.2\n99.8 \u00b1 0.4\n99.1 \u00b1 0.1\n\n6\n\nConv1\u00d71,64reluBatchNormConv3\u00d73,64BatchNormConv1\u00d71,256BatchNormrelu\u2a01reluxIdentity\u2131(x)\u2131x+xConv5\u00d75,64dropout,reluTransNormConv5\u00d75,128TransNormConv5\u00d75,256TransNormx\u2131(x)dropout,reludropout,reluConv1\u00d71,64reluTransNormConv3\u00d73,64TransNormConv1\u00d71,256TransNormrelu\u2a01reluxIdentity\u2131(x)\u2131x+xPreviousLayerTransNormTransNormTransNormConv1\u00d71TransNormTransNormTransNormTransNormFilterConcatenationConv1\u00d71Conv3\u00d73Conv1\u00d71Conv3\u00d73Conv3\u00d73Conv1\u00d71Pool3\u00d73\fTable 2: Accuracy (%) on Image-CLEF for unsupervised domain adaption (ResNet-50).\n\nI\u2192P\n\nMethod\n74.8 \u00b1 0.3\nResNet-50 [9]\n74.5 \u00b1 0.4\nDAN [17]\n75.0 \u00b1 0.3\nDANN+BN [4]\nDANN+TransNorm 78.2 \u00b1 0.4\n77.7 \u00b1 0.3\nCDAN+BN [18]\nCDAN+TransNorm 78.3 \u00b1 0.3\n\nP\u2192I\n\n83.9 \u00b1 0.1\n82.2 \u00b1 0.2\n86.0 \u00b1 0.3\n89.5\u00b1 0.2\n90.7 \u00b1 0.2\n90.8 \u00b1 0.2\n\nI\u2192C\n\n91.5 \u00b1 0.3\n92.8 \u00b1 0.2\n96.2 \u00b1 0.4\n95.5 \u00b1 0.2\n97.7 \u00b1 0.3\n96.7 \u00b1 0.4\n\nC\u2192I\n\n78.0 \u00b1 0.2\n86.3 \u00b1 0.4\n87.0 \u00b1 0.5\n91.0 \u00b1 0.2\n91.3 \u00b1 0.3\n92.3 \u00b1 0.2\n\nC\u2192P\n\n65.5 \u00b1 0.3\n69.2 \u00b1 0.4\n74.3 \u00b1 0.5\n76.0 \u00b1 0.2\n74.2 \u00b1 0.2\n78.0 \u00b1 0.1\n\nP\u2192C\n\n91.2 \u00b1 0.3\n89.8 \u00b1 0.4\n91.5 \u00b1 0.6\n91.5 \u00b1 0.3\n94.3 \u00b1 0.3\n94.8 \u00b1 0.3\n\nAvg\n80.7\n82.5\n85.0\n87.0\n87.7\n88.5\n\nTable 3: Accuracy (%) on Of\ufb01ce-Home for unsupervised domain adaption (ResNet-50).\n\nAr:Cl Ar:Pr Ar:Rw Cl:Ar Cl:Pr Cl:Rw Pr:Ar Pr:Cl Pr:Rw Rw:Ar Rw:Cl Rw:Pr Avg\nMethod (S:T)\n46.1\n34.9 50.0 58.0\nResNet-50 [9]\n59.9\n41.2\n43.6 57.0 67.9\n56.3\nDAN [17]\n74.3\n51.5\n58.3\n45.9 61.2 68.9\nJAN [20]\n76.8\n52.4\nDANN+BN [4]\n45.6 59.3 70.1\n57.6\n51.8\n76.8\n59.3\nDANN+TransNorm 43.5 60.9 72.1\n52.2 77.9\nCDAN+BN [18]\n50.7 70.6 76.0\n65.8\n56.7\n81.6\n82.9 67.6\n59.0\nCDAN+TransNorm 50.2 71.4 77.4\n\n53.9\n38.5 31.2 60.4\n37.4 41.9 46.2\n63.1\n44.0 43.6 67.7\n45.8 56.5 60.4\n63.9\n45.8 43.4 70.3\n50.4 59.7 61.0\n63.2\n47.0 58.5 60.9\n46.1 43.7 68.5\n63.7\n51.0 61.5 62.5 49.6 46.8 70.4\n70.9\n57.6 70.0 70.0\n57.4 50.9 77.3\n61.0 53.1 79.5 71.9\n59.3 72.7 73.1\n\n4.2 Basic Results\nOf\ufb01ce-31. As Table 1 shows, TransNorm enables DANN to surpass methods with more complicated\nnetworks and loss designs. Meanwhile, TransNorm can further improve CDAN, a competitive method.\nImageCLEF-DA. Table 2 reports the results on ImageCLEF-DA. TransNorm strengthens domain\nadaptation methods in most tasks, though the improvement is smaller due to more similar domains.\nOf\ufb01ce-Home. Table 3 also demonstrates signi\ufb01cantly improved the results of TransNorm on Of\ufb01ce-\nHome. This dataset is much more complicated, showing the ef\ufb01cacy of TransNorm on hard problems.\n\nTable 4: Accuracy (%) on more adaptation methods, network backbones, and datasets.\n\nAdaptation Method\n\nNetwork Backbone\n\nDataset\n\nMethod\nDANN [4]\nCDAN [18]\n\nAvg Method (Inception) Avg Method (DTN)\n75.5 CyCADA [10]\n82.2 Inception-BN [12]\n87.7 AdaBN [15]\n76.7 MCD [32]\n82.1 DANN+BN [4]\n88.8 DANN+BN [4]\nMDD+BN [46]\nMDD+TransNorm 89.5 CDAN+BN [18]\n\nVisDA\n69.5\n69.7\n63.7\nDANN+TransNorm 83.2 DANN+TransNorm 77.0 DANN+TransNorm 66.3\n70.0\nCDAN+TransNorm 86.0 CDAN+TransNorm 94.7 CDAN+TransNorm 71.4\n\nS\u2192M Method\n90.4 GTA [33]\n94.2 MCD [32]\n73.9 DANN+BN [4]\n\n84.9 CDAN+BN [18]\n\n89.2 CDAN+BN [18]\n\n4.3 Generalized Results\nMore Adaptation Methods. TransNorm can be applied to more sophisticated domain adaptation\nmethods. To show this, we further evaluate it on MDD [46], the latest state of the art method. Results\non Of\ufb01ce-31 dataset in Table 4 (left column) indicate that TransNorm can further improve MDD.\nMore Network Backbones. We evaluate another two network backbones: Inception-BN and a\nsmaller CNN network to ensure whether TransNorm can work well when trained from scratch. As\nshown in Table 4 (middle column), DANN and CDAN are both improved apparently. Enhanced by\nTransNorm, CDAN outperforms all other complicated methods in the dif\ufb01cult digit task (S\u2192M).\nMore Challenging Datasets. We apply TransNorm with domain adaptation methods on large-scale\ndatasets, VisDA-2017. As shown in Table 4 (right column), TransNorm is consistently effective.\nMore Normalization Methods. We throughly compare with other normalization techniques re-\nspectively on DANN [4] and CDAN [18]. As shown in Table 5, TransNorm consistently outperforms\nthe two related methods based either on DANN or CDAN. Further, since AdaBN does not exploit the\ndiscriminative statistics of the target domain at training, CDAN+AdaBN performs worse than CDAN.\n\n7\n\n\fTable 5: Accuracy comparison with BN, AdaBN and AutoDIAL (%) on Of\ufb01ce-31 (ResNet-50).\n\nMethod\n\nDANN [4]\n\nCDAN [18]\n\nNormalization BN AdaBN AutoDIAL TransNorm BN\n94.1\n98.6\n100.0\n92.9\n71.5\n69.3\n87.7\n\nA\u2192W\nD\u2192W\nW\u2192D\nA\u2192D\nD\u2192A\nW\u2192A\nAvg\n\n91.8\n97.7\n100.0\n88.0\n68.2\n70.4\n86.0\n\n84.8\n97.7\n100.0\n85.7\n63.9\n68.7\n83.5\n\n82.4\n97.7\n99.8\n81.0\n67.2\n68.2\n82.7\n\n82.0\n96.9\n99.1\n79.7\n68.2\n67.4\n82.2\n\nAdaBN AutoDIAL TransNorm\n88.8\n98.6\n100.0\n92.7\n70.8\n70.0\n86.8\n\n95.7\n98.7\n100.0\n94.0\n73.4\n74.2\n89.3\n\n92.3\n98.6\n100.0\n93.0\n71.5\n72.2\n87.9\n\nInsight Analyses\n\n4.4\nAblation Study. The domain adaptive alpha highlight more transferable channels by calculating\nchannel-wise transferability \u03b1 in two steps: Calculating Distance in Eq. (5) and Generating Probabil-\nity in Eq. (6). Calculating the distance only between means \u00b5 cannot capture the variance \u03c3. While\nWasserstein distance [40] uses both \u00b5 and \u03c3 as (cid:107)\u00b5s \u2212 \u00b5t(cid:107)2\nof \u00b5 and \u03c3 are not balanced. Our strategy calculates the distance between means normalized by\n\u03c32 + \u0001, which is consistent with BatchNorm [12] and yields the best results, as shown\nvariance \u00b5/\nin Table 6. Using the same strategy \u00b5/\n\u03c32 + \u0001 of calculating distance, we further examine the design\nof generating probability. Generating the distance-based probability to quantify the transferability\nof each channel, Softmax\u2019s winner-takes-all strategy is not suitable, while Gaussian\u2019s tail is not as\nheavy as Student-t. Only Student-t distribution has heavier tails that highlight transferable channels\nwhile avoiding overly penalizing the others, supported by the results in Table 6.\n\n(cid:1), the relative impact\n\n2 +(cid:0)\u03c32\n\nt \u2212 2\u03c3s\u03c3t\n\ns + \u03c32\n\n\u221a\n\n\u221a\n\nTable 6: Ablation study (%) on Of\ufb01ce-31 with CDAN+TransNorm (ResNet-50).\nProbability Type\n\nWeight\n\nType\n\nDistance Type\nWasserstein Distance\n\n\u00b5/\n\nAblation\nA\u2192W\nD\u2192W\nW\u2192D\nA\u2192D\nD\u2192A\nW\u2192A\nAvg\n\n\u03b1 = 1\n94.6\n98.6\n100.0\n93.4\n71.5\n72.9\n88.5\n\n\u00b5\n95.0\n98.7\n100.0\n93.0\n72.8\n73.7\n88.9\n\n94.5\n98.7\n100.0\n91.0\n72.6\n73.6\n88.4\n\n\u221a\n\u03c32 + \u0001\n95.7\n98.7\n100.0\n94.0\n73.4\n74.2\n89.3\n\nSoftmax Gaussian\n\nStudent-t\n\n94.9\n97.9\n99.8\n91.4\n69.7\n73.2\n87.9\n\n94.8\n98.6\n100.0\n93.1\n72.4\n73.0\n88.7\n\n95.7\n98.7\n100.0\n94.0\n73.4\n74.2\n89.3\n\nFurther, to verify that domain speci\ufb01c mean and variance are the most reasonable design, we set\n\u03b1 to 1 in TransNorm. As shown in Table 6, results of TransNorm (\u03b1 = 1) are better than that\nof AutoDIAL, concluding that domain-speci\ufb01c mean and variance are better than domain-mixed\nones adopted by AutoDIAL. AutoDIAL aims at learning domain mixing parameter but its learnable\nparameter \u03b1 converges to \u03b1 \u2248 1 (implying domain-speci\ufb01c) when the network becomes deeper, as\nshown in Figure 3(b) (ResNet-50), consistent with Figure 3 (Inception-BN) of its original paper [22].\n\n(a) Convergence\n\n(b) \u03b1 in AutoDIAL\n\n(c) A-distance\n\n(d) lambda\n\nFigure 3: Visualization of convergence speed, \u03b1 in AutoDIAL [22], A-distance and \u03bb.\n\nConvergence Speed. TransNorm can also accelerate convergence, as shown in Figure 3(a). Take\nCDAN on A\u2192W as an example, it only needs 1, 500 iterations for CDAN with TransNorm to reach\nthe accuracy of 95%, while at this point, the accuracy of CDAN with BN is still well below 90%.\n\n8\n\n024681012Iterations (k)707580859095100Accuracy (%)CDAN (A->W)CDAN + TN (A->W)CDAN (A->D)CDAN + TN (A->D)01020304050layer depth0.930.940.950.960.970.980.99alphaA2WA2DW2AD2A0.20.40.60.81.01.21.41.61.8A2W1.01.11.21.31.41.51.61.7A-distanceDANNDANN+AdaBNDANN+AutoDIALDANN+TransNorm0.20.40.60.81.01.21.41.61.8A2W0.000.020.040.060.080.10lambdaDANNDANN+AdaBNDANN+AutoDIALDANN+TransNorm\fTheoretical Analysis. Ben-David et al. [2] derived the learning bound of domain adaptation as\nET (h) \u2264 ES (h) + 1\n2 dH\u2206H(S,T ) + \u03bb, which bounds the expected error ET (h) of a hypothesis h\non the target domain by three terms: (a) expected error of h on the source domain, ES (h); (b) the\nA-distance dH\u2206H(S,T ) = 2(1 \u2212 2\u0001), a measure of domain discrepancy, where \u0001 is the error rate of\na domain classi\ufb01er trained to discriminate source domain and target domain; and (c) the error \u03bb of\nthe ideal joint hypothesis h\u2217 on both source and target domains. By calculating A-distance and \u03bb on\ntransfer task A \u2192 W with DANN, we found that our TransNorm can achieve a lower A-distance as\nwell as a lower \u03bb (Figure 3), implying a lower generalization error by TransNorm.\n\nFigure 4: Visualization of channels in the \ufb01rst convolutional layer of CDAN (TransNorm).\n\nChannel Visualization. Though \u03b1 in Eq. (6) is shared across domains, it is speci\ufb01c to each channel\nto highlight transferable channels. Here we calculate the transferability \u03b1 for each channel after the\ntraining process converges. The channels with the largest and smallest transferability are shown in\nFigure 4. The channels with higher transferability capture more sharing patterns such as textures of\nobjects while those with lower transferability focus on the background or varying patterns.\n\n(a) DANN (BN)\n\n(c) CDAN (BN)\n\n(d) CDAN (TransNorm)\nFigure 5: Visualization of features from the models with BN and TransNorm. (red: A; blue: D).\n\n(b) DANN (TransNorm)\n\nFeature Visualization. We use the t-SNE tool [21] to visualize the representations in the bottleneck\nlayer of ResNet-50. We compare the models with BN and with TransNorm, based on DANN and\nCDAN on task A\u2192D of Of\ufb01ce-31. As shown in Figure 5, the source and target features in models with\nTransNorm are more indistinguishable, implying more transferable features learned by TransNorm.\nMeanwhile, class boundaries are sharper, proving that TransNorm learns more discriminative features.\n5 Conclusion\nWe presented TransNorm, a simple, general and end-to-end trainable normalization layer towards\nimproving the transferability of deep neural networks. It can be simply applied to a variety of network\nbackbones by solely replacing batch normalization. Empirical studies show that TransNorm signi\ufb01-\ncantly boosts existing domain adaptation methods, yielding higher accuracy and faster convergence.\nAcknowledgments\nThis work was supported by the National Key R&D Program of China (2017YFC1502003) and the\nNatural Science Foundation of China (61772299 and 71690231).\n\n9\n\nChannel#05\ud835\udefc(#$)=1.09SourceDomain(Art)TargetDomain(RealWorld)In each group,Left: the original input image;Middle: thefeaturemapofselectedchannel;Right: the combinationoftheinputimageandtheselectedchannel.Channel#39\ud835\udefc(+,)=1.08Channel#32\ud835\udefc(+.)=0.91Channel#48\ud835\udefc(/0)=0.90\fReferences\n[1] L. J. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. In NeurIPS, 2016.\n\n[2] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of\n\nlearning from different domains. Machine learning, 79(1-2):151\u2013175, 2010.\n\n[3] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep\n\nconvolutional activation feature for generic visual recognition. In ICML, 2014.\n\n[4] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and\nV. S. Lempitsky. Domain-adversarial training of neural networks. JMLR, 17:59:1\u201359:35, 2016.\n\n[5] B. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: Discriminatively\n\nlearning domain-invariant features for unsupervised domain adaptation. In ICML, 2013.\n\n[6] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain\n\nadaptation. In CVPR, 2012.\n\n[7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NeurIPS, 2014.\n\n[8] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K.\n\nSriperumbudur. Optimal kernel choice for large-scale two-sample tests. In NeurIPS, 2012.\n\n[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n2016.\n\n[10] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell. Cycada:\n\nCycle-consistent adversarial domain adaptation. In ICML, 2018.\n\n[11] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Sch\u00f6lkopf. Correcting sample\n\nselection bias by unlabeled data. In NeurIPS, 2006.\n\n[12] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015.\n\n[13] S. Kornblith, J. Shlens, and Q. V. Le. Do better imagenet models transfer better? CVPR, 2019.\n\n[14] Y. LeCun, L. Bottou, G. B. Orr, and K. R. M\u00fcller. Ef\ufb01cient BackProp, pages 9\u201350. Springer\n\nBerlin Heidelberg, Berlin, Heidelberg, 1998.\n\n[15] Y. Li, N. Wang, J. Shi, X. Hou, and J. Liu. Adaptive batch normalization for practical domain\n\nadaptation. Pattern Recognition, 80:109\u2013117, 2018.\n\n[16] M. Long, Y. Cao, Z. Cao, J. Wang, and M. I. Jordan. Transferable representation learning with\n\ndeep adaptation networks. TPAMI, pages 1\u20131, 2018.\n\n[17] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation\n\nnetworks. In ICML, 2015.\n\n[18] M. Long, Z. Cao, J. Wang, and M. I. Jordan. Conditional adversarial domain adaptation. In\n\nNeurIPS, 2018.\n\n[19] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual\n\ntransfer networks. In NeurIPS, 2016.\n\n[20] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation\n\nnetworks. In ICML, 2017.\n\n[21] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. JMLR, 9(Nov):2579\u20132605, 2008.\n\n[22] F. Maria Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. Rota BulA. Autodial: Automatic domain\n\nalignment layers. In ICCV, 2017.\n\n10\n\n\f[23] M. Mirza and S. Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784,\n\n2014.\n\n[24] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image\n\nrepresentations using convolutional neural networks. In CVPR, 2014.\n\n[25] S. J. Pan and Q. Yang. A survey on transfer learning. TKDE, 22(10):1345\u20131359, 2010.\n[26] Z. Pei, Z. Cao, M. Long, and J. Wang. Multi-adversarial domain adaptation. In AAAI, 2018.\n[27] X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko. Visda: The visual\n\ndomain adaptation challenge. CoRR, abs/1710.06924, 2017.\n\n[28] P. O. Pinheiro. Unsupervised domain adaptation with similarity learning. In CVPR, 2018.\n[29] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset Shift in\n\nMachine Learning. The MIT Press, 2009.\n\n[30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition\nChallenge. 2014.\n\n[31] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains.\n\nIn ECCV, 2010.\n\n[32] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classi\ufb01er discrepancy for unsuper-\n\nvised domain adaptation. In CVPR, 2018.\n\n[33] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, and R. Chellappa. Generate to adapt: Aligning\n\ndomains using generative adversarial networks. In CVPR, 2018.\n\n[34] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How does batch normalization help optimiza-\n\ntion? In NeurIPS, 2018.\n\n[35] Student. The probable error of a mean. Biometrika, pages 1\u201325, 1908.\n[36] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains\n\nand tasks. In ICCV, 2015.\n\n[37] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation.\n\nIn CVPR, 2017.\n\n[38] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximiz-\n\ning for domain invariance. CoRR, abs/1412.3474, 2014.\n\n[39] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep Hashing Network for\n\nUnsupervised Domain Adaptation. In CVPR, 2017.\n\n[40] C. Villani. The wasserstein distances. In Optimal Transport: Old and New, pages 93\u2013111.\n\nSpringer Berlin Heidelberg, Berlin, Heidelberg, 2009.\n\n[41] X. Wang, L. Li, W. Ye, M. Long, and J. Wang. Transferable attention for domain adaptation. In\n\nAAAI, 2019.\n\n[42] S. Wiesler and H. Ney. A convergence analysis of log-linear training. In NeurIPS. 2011.\n[43] Y. Wu and K. He. Group normalization. In ECCV, 2018.\n[44] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural\n\nnetworks? In NeurIPS, 2014.\n\n[45] K. You, X. Wang, M. Long, and M. Jordan. Towards accurate model selection in deep unsuper-\n\nvised domain adaptation. In ICML, 2019.\n\n[46] Y. Zhang, T. Liu, M. Long, and M. Jordan. Bridging theory and algorithm for domain adaptation.\n\nIn ICML, 2019.\n\n[47] H. Zhao, R. T. des Combes, K. Zhang, and G. J. Gordon. On learning invariant representation\n\nfor domain adaptation. In ICML, 2019.\n\n11\n\n\f", "award": [], "sourceid": 1135, "authors": [{"given_name": "Ximei", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Ying", "family_name": "Jin", "institution": "Tsinghua University"}, {"given_name": "Mingsheng", "family_name": "Long", "institution": "Tsinghua University"}, {"given_name": "Jianmin", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}