{"title": "Unsupervised Domain Adaptation with Residual Transfer Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 136, "page_last": 144, "abstract": "The recent success of deep neural networks relies on massive amounts of labeled data. For a target task where labeled data is unavailable, domain adaptation can transfer a learner from a different source domain. In this paper, we propose a new approach to domain adaptation in deep networks that can jointly learn adaptive classifiers and transferable features from labeled data in the source domain and unlabeled data in the target domain. We relax a shared-classifier assumption made by previous methods and assume that the source classifier and target classifier differ by a residual function. We enable classifier adaptation by plugging several layers into deep network to explicitly learn the residual function with reference to the target classifier. We fuse features of multiple layers with tensor product and embed them into reproducing kernel Hilbert spaces to match distributions for feature adaptation. The adaptation can be achieved in most feed-forward models by extending them with new residual layers and loss functions, which can be trained efficiently via back-propagation. Empirical evidence shows that the new approach outperforms state of the art methods on standard domain adaptation benchmarks.", "full_text": "Unsupervised Domain Adaptation with Residual\n\nTransfer Networks\n\nMingsheng Long\u2020, Han Zhu\u2020, Jianmin Wang\u2020, and Michael I. Jordan(cid:93)\n\u2020KLiss, MOE; TNList; School of Software, Tsinghua University, China\n\n(cid:93)University of California, Berkeley, Berkeley, USA\n\n{mingsheng,jimwang}@tsinghua.edu.cn, zhuhan10@gmail.com, jordan@berkeley.edu\n\nAbstract\n\nThe recent success of deep neural networks relies on massive amounts of labeled\ndata. For a target task where labeled data is unavailable, domain adaptation can\ntransfer a learner from a different source domain. In this paper, we propose a new\napproach to domain adaptation in deep networks that can jointly learn adaptive\nclassi\ufb01ers and transferable features from labeled data in the source domain and\nunlabeled data in the target domain. We relax a shared-classi\ufb01er assumption made\nby previous methods and assume that the source classi\ufb01er and target classi\ufb01er\ndiffer by a residual function. We enable classi\ufb01er adaptation by plugging several\nlayers into deep network to explicitly learn the residual function with reference\nto the target classi\ufb01er. We fuse features of multiple layers with tensor product\nand embed them into reproducing kernel Hilbert spaces to match distributions for\nfeature adaptation. The adaptation can be achieved in most feed-forward models by\nextending them with new residual layers and loss functions, which can be trained\nef\ufb01ciently via back-propagation. Empirical evidence shows that the new approach\noutperforms state of the art methods on standard domain adaptation benchmarks.\n\n1\n\nIntroduction\n\nDeep networks have signi\ufb01cantly improved the state of the art for a wide variety of machine-learning\nproblems and applications. Unfortunately, these impressive gains in performance come only when\nmassive amounts of labeled data are available for supervised training. Since manual labeling of\nsuf\ufb01cient training data for diverse application domains on-the-\ufb02y is often prohibitive, for problems\nshort of labeled data, there is strong incentive to establishing effective algorithms to reduce the\nlabeling consumption, typically by leveraging off-the-shelf labeled data from a different but related\nsource domain. However, this learning paradigm suffers from the shift in data distributions across\ndifferent domains, which poses a major obstacle in adapting predictive models for the target task [1].\nDomain adaptation [1] is machine learning under the shift between training and test distributions. A\nrich line of approaches to domain adaptation aim to bridge the source and target domains by learning\ndomain-invariant feature representations without using target labels, so that the classi\ufb01er learned from\nthe source domain can be applied to the target domain. Recent studies have shown that deep networks\ncan learn more transferable features for domain adaptation [2, 3], by disentangling explanatory factors\nof variations behind domains. Latest advances have been achieved by embedding domain adaptation\nin the pipeline of deep feature learning which can extract domain-invariant representations [4, 5, 6, 7].\nThe previous deep domain adaptation methods work under the assumption that the source classi\ufb01er can\nbe directly transferred to the target domain upon the learned domain-invariant feature representations.\nThis assumption is rather strong as in practical applications, it is often infeasible to check whether\nthe source classi\ufb01er and target classi\ufb01er can be shared or not. Hence we focus in this paper on a more\ngeneral, and safe, domain adaptation scenario in which the source classi\ufb01er and target classi\ufb01er differ\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fby a small perturbation function. The goal of this paper is to simultaneously learn adaptive classi\ufb01ers\nand transferable features from labeled data in the source domain and unlabeled data in the target\ndomain by embedding the adaptations of both classi\ufb01ers and features in a uni\ufb01ed deep architecture.\nMotivated by the state of the art deep residual learning [8], winner of the ImageNet ILSVRC 2015\nchallenge, we propose a new Residual Transfer Network (RTN) approach to domain adaptation in\ndeep networks which can simultaneously learn adaptive classi\ufb01ers and transferable features. We relax\nthe shared-classi\ufb01er assumption made by previous methods and assume that the source and target\nclassi\ufb01ers differ by a small residual function. We enable classi\ufb01er adaptation by plugging several\nlayers into deep networks to explicitly learn the residual function with reference to the target classi\ufb01er.\nIn this way, the source classi\ufb01er and target classi\ufb01er can be bridged tightly in the back-propagation\nprocedure. The target classi\ufb01er is tailored to the target data better by exploiting the low-density\nseparation criterion. We fuse features of multiple layers with tensor product and embed them into\nreproducing kernel Hilbert spaces to match distributions for feature adaptation. The adaptation can be\nachieved in most feed-forward models by extending them with new residual layers and loss functions,\nand can be trained ef\ufb01ciently using standard back-propagation. Extensive evidence suggests that the\nRTN approach outperforms several state of art methods on standard domain adaptation benchmarks.\n\n2 Related Work\n\nDomain adaptation [1] builds models that can bridge different domains or tasks, which mitigates\nthe burden of manual labeling for machine learning [9, 10, 11, 12], computer vision [13, 14, 15]\nand natural language processing [16]. The main technical problem of domain adaptation is that the\ndomain discrepancy in probability distributions of different domains should be formally reduced.\nDeep neural networks can learn abstract representations that disentangle different explanatory factors\nof variations behind data samples [17] and manifest invariant factors underlying different populations\nthat transfer well from original tasks to similar novel tasks [3]. Thus deep neural networks have been\nexplored for domain adaptation [18, 19, 15], multimodal and multi-task learning [16, 20], where\nsigni\ufb01cant performance gains have been witnessed relative to prior shallow transfer learning methods.\nHowever, recent advances show that deep networks can learn abstract feature representations that\ncan only reduce, but not remove, the cross-domain discrepancy [18, 4]. Dataset shift has posed a\nbottleneck to the transferability of deep features, resulting in statistically unbounded risk for target\ntasks [21, 22]. Some recent work addresses the aforementioned problem by deep domain adaptation,\nwhich bridges the two worlds of deep learning and domain adaptation [4, 5, 6, 7]. They extend deep\nconvolutional networks (CNNs) to domain adaptation either by adding one or multiple adaptation\nlayers through which the mean embeddings of distributions are matched [4, 5], or by adding a fully\nconnected subnetwork as a domain discriminator whilst the deep features are learned to confuse\nthe domain discriminator in a domain-adversarial training paradigm [6, 7]. While performance was\nsigni\ufb01cantly improved, these state of the art methods may be restricted by the assumption that under\nthe learned domain-invariant feature representations, the source classi\ufb01er can be directly transferred\nto the target domain. In particular, this assumption may not hold when the source classi\ufb01er and target\nclassi\ufb01er cannot be shared. As theoretically studied in [22], when the combined error of the ideal\njoint hypothesis is large, then there is no single classi\ufb01er that performs well on both source and target\ndomains, so we cannot \ufb01nd a good target classi\ufb01er by directly transferring from the source domain.\nThis work is primarily motivated by He et al. [8], the winner of the ImageNet ILSVRC 2015 challenge.\nThey present deep residual learning to ease the training of very deep networks (hundreds of layers),\ntermed residual nets. The residual nets explicitly reformulate the layers as learning residual functions\n\u2206F (x) with reference to the layer inputs x, instead of directly learning the unreferenced functions\nF (x) = \u2206F (x) + x. The method focuses on standard deep learning in which training data and test\ndata are drawn from identical distributions, hence it cannot be directly applied to domain adaptation.\nIn this paper, we propose to bridge the source classi\ufb01er fS(x) and target classi\ufb01er fT (x) by the\nresidual layers such that the classi\ufb01er mismatch across domains can be explicitly modeled by the\nresidual functions \u2206F (x) in a deep learning architecture. Although the idea of adapting source\nclassi\ufb01er to target domain by adding a perturbation function has been studied by [23, 24, 25], these\nmethods require target labeled data to learn the perturbation function, which cannot be applied to\nunsupervised domain adaptation, the focus of this study. Another distinction is that their perturbation\nfunction is de\ufb01ned in the input space x, while the input to our residual function is the target classi\ufb01er\nfT (x), which can capture the connection between the source and target classi\ufb01ers more effectively.\n\n2\n\n\fi , ys\n\ni )}ns\n\n3 Residual Transfer Networks\nIn unsupervised domain adaptation problem, we are given a source domain Ds = {(xs\ni=1 of ns\nj}nt\nlabeled examples and a target domain Dt = {xt\nj=1 of nt unlabeled examples. The source domain\nand target domain are sampled from different probability distributions p and q respectively, and p (cid:54)= q.\nThe goal of this paper is to design a deep neural network that enables learning of transfer classi\ufb01ers\ny = fs (x) and y = ft (x) to close the source-target discrepancy, such that the expected target risk\nRt (ft) = E(x,y)\u223cq [ft (x) (cid:54)= y] can be bounded by leveraging the source domain supervised data.\nThe challenge of unsupervised domain adaptation arises in that the target domain has no labeled\ndata, while the source classi\ufb01er fs trained on source domain Ds cannot be directly applied to the\ntarget domain Dt due to the distribution discrepancy p(x, y) (cid:54)= q(x, y). The distribution discrepancy\nmay give rise to mismatches in both features and classi\ufb01ers, i.e. p(x) (cid:54)= q(x) and fs(x) (cid:54)= ft(x).\nBoth mismatches should be \ufb01xed by joint adaptation of features and classi\ufb01ers to enable effective\ndomain adaptation. Classi\ufb01er adaptation is more dif\ufb01cult than feature adaptation because it is directly\nrelated to the labels but the target domain is fully unlabeled. Note that the state of the art deep feature\nadaptation methods [5, 6, 7] generally assume classi\ufb01ers can be shared on adapted deep features. This\npaper assumes fs (cid:54)= ft and presents an end-to-end deep learning framework for classi\ufb01er adaptation.\nDeep networks [17] can learn distributed, compositional, and abstract representations for natural data\nsuch as image and text. This paper addresses unsupervised domain adaptation within deep networks\nfor jointly learning transferable features and adaptive classi\ufb01ers. We extend deep convolutional\nnetworks (CNNs), i.e. AlexNet [26], to novel residual transfer networks (RTNs) as shown in Figure 1.\nDenote by fs(x) the source classi\ufb01er, and the empirical error of CNN on source domain data Ds is\n\nns(cid:88)\n\ni=1\n\nmin\nfs\n\n1\nns\n\nL (fs (xs\n\ni ) , ys\n\ni ),\n\n(1)\n\nwhere L(\u00b7,\u00b7) is the cross-entropy loss function. Based on the quanti\ufb01cation study of feature transfer-\nability in deep convolutional networks [3], convolutional layers can learn generic features that are\ntransferable across domains [3]. Hence we opt to \ufb01ne-tune, instead of directly adapt, the features of\nconvolutional layers when transferring pre-trained deep models from source domain to target domain.\n\n3.1 Feature Adaptation\n\nDeep features learned by CNNs can disentangle explanatory factors of variations behind data distri-\nbutions to boost knowledge transfer [19, 17]. However, the latest literature \ufb01ndings reveal that deep\nfeatures can reduce, but not remove, the cross-domain distribution discrepancy [3], which motivates\nthe state of the art deep feature adaptation methods [5, 6, 7]. Deep features in standard CNNs must\neventually transition from general to speci\ufb01c along the network, and the transferability of features and\nclassi\ufb01ers will decrease when the cross-domain discrepancy increases [3]. In other words, the shifts\nin the data distributions linger even after multilayer feature abstractions. In this paper, we perform\nfeature adaptation by matching the feature distributions of multiple layers (cid:96) \u2208 L across domains. We\nreduce feature dimensions by adding a bottleneck layer f cb on top of the last feature layer of CNNs,\nand then \ufb01ne-tune CNNs on source labeled examples such that the feature distributions of the source\nand target are made similar under new feature representations in multiple layers L = {f cb, f cc}, as\nshown in Figure 1. To adapt multiple feature layers effectively, we propose the tensor product between\nfeatures of multiple layers to perform lossless multi-layer feature fusion, i.e. zs\ni and\ni\nj . We then perform feature adaptation by minimizing the Maximum Mean Discrepancy\nzt\nj\n(MMD) [27] between source and target domains using the fusion features (dubbed tensor MMD) as\n\nk(cid:0)zt\ni, zt\nj\nn2\nt\nwhere the characteristic kernel k(z, z(cid:48)) = e\u2212(cid:107)vec(z)\u2212vec(z(cid:48))(cid:107)2\n/b is the Gaussian kernel function\nde\ufb01ned on the vectorization of tensors z and z(cid:48) with bandwidth parameter b. Different from DAN [5]\nthat adapts multiple feature layers using multiple MMD penalties, this paper adapts multiple feature\nlayers by \ufb01rst fusing them and then adapting the fused features. The advantage of our method against\nDAN [5] is that our method can capture full interactions across multilayer features and facilitate\neasier model selection, while DAN [5] needs |L| independent MMD penalties for adapting |L| layers.\n\nDL (Ds,Dt) =\n\n(cid:44) \u2297(cid:96)\u2208Lxs(cid:96)\n\nk(cid:0)zs\n\nk(cid:0)zs\n\n(cid:44) \u2297(cid:96)\u2208Lxt(cid:96)\n\ni , zt\nj\nnsnt\n\ni , zs\nj\nn2\ns\n\nns(cid:88)\n\nnt(cid:88)\n\nnt(cid:88)\n\nnt(cid:88)\n\nns(cid:88)\n\nns(cid:88)\n\nmin\nfs,ft\n\n,\n\n(2)\n\n\u2212 2\n\ni=1\n\nj=1\n\ni=1\n\nj=1\n\ni=1\n\nj=1\n\n(cid:1)\n\n+\n\n(cid:1)\n\n(cid:1)\n\n3\n\n\fFigure 1: (left) Residual Transfer Network (RTN) for domain adaptation, based on well-established\narchitectures. Due to dataset shift, (1) the last-layer features are tailored to domain-speci\ufb01c structures\nthat are not safely transferable, hence we add a bottleneck layer f cb that is adapted jointly with the\nclassi\ufb01er layer f cc by the tensor MMD module; (2) Supervised classi\ufb01ers are not safely transferable,\nhence we bridge them by the residual layers f c1\u2013f c2 so that fS (x) = fT (x)+\u2206f (x). (middle) The\ntensor MMD module for multi-layer feature adaptation. (right) The building block for deep residual\nlearning; Instead of using the residual block to model feature mappings, we use it to bridge the source\nclassi\ufb01er fS(x) and target classi\ufb01er fT (x) with x (cid:44) fT (x), F (x) (cid:44) fS(x), and \u2206F (x) (cid:44) \u2206f (x).\n\n3.2 Classi\ufb01er Adaptation\n\nAs feature adaptation cannot remove the mismatch in classi\ufb01cation models, we further perform\nclassi\ufb01er adaptation to learn transfer classi\ufb01ers that make domain adaptation more effective. Although\nthe source classi\ufb01er fs(x) and target classi\ufb01er ft(x) are different, fs(x) (cid:54)= ft(x), they should be\nrelated to ensure the feasibility of domain adaptation. It is reasonable to assume that fs(x) and ft(x)\ndiffer only by a small perturbation function \u2206f (x). Prior work on classi\ufb01er adaptation [23, 24, 25]\nassumes that ft(x) = fs(x) + \u2206f (x), where the perturbation \u2206f (x) is a function of input feature x.\nHowever, these methods require target labeled data to learn the perturbation function, which cannot\nbe applied to unsupervised domain adaptation where target domain has no labeled data. How to\nbridge fs(x) and ft(x) in a framework is a key challenge of unsupervised domain adaptation. We\npostulate that the perturbation function \u2206f (x) can be learned jointly from the source labeled data\nand target unlabeled data, given that the source classi\ufb01er and target classi\ufb01er are properly connected.\nTo enable classi\ufb01er adaptation, consider \ufb01tting F (x) as an original mapping by a few stacked layers\n(convolutional or fully connected layers) in Figure 1 (right), where x denotes the inputs to the \ufb01rst of\nthese layers [8]. If one hypothesizes that multiple nonlinear layers can asymptotically approximate\ncomplicated functions, then it is equivalent to hypothesize that they can asymptotically approximate\nthe residual functions, i.e. F (x) \u2212 x. Rather than expecting stacked layers to approximate F (x), one\nexplicitly lets these layers approximate a residual function \u2206F (x) (cid:44) F (x) \u2212 x, with the original\nfunction being \u2206F (x) + x. The operation \u2206F (x) + x is performed by a shortcut connection and an\nelement-wise addition, while the residual function is parameterized by residual layers within each\nresidual block. Although both forms are able to asymptotically approximate the desired functions, the\nease of learning is different. In reality, it is unlikely that identity mappings are optimal, but it should\nbe easier to \ufb01nd the perturbations with reference to an identity mapping, than to learn the function\nas new. The residual learning is the key to the successful training of very deep networks. The deep\nresidual network (ResNet) framework [8] bridges the inputs and outputs of the residual layers by the\nshortcut connection (identity mapping) such that F (x) = \u2206F (x) + x, which eases the learning of\nresidual function \u2206F (x) (similar to the perturbation function across the source and target classi\ufb01ers).\nBased on this key observation, we extend the CNN architectures (Figure 1, left) by plugging in the\nresidual block (Figure 1, right). We reformulate the residual block to bridge the source classi\ufb01er\nfS(x) and target classi\ufb01er fT (x) by letting x (cid:44) fT (x), F (x) (cid:44) fS(x), and \u2206F (x) (cid:44) \u2206f (x). Note\nthat fS(x) is the outputs of the element-wise addition operator and fT (x) is the outputs of the target-\nclassi\ufb01er layer f cc, both before softmax activation \u03c3(\u00b7), fs (x) (cid:44) \u03c3 (fS (x)) , ft (x) (cid:44) \u03c3 (fT (x)).\nWe can connect the source classi\ufb01er and target classi\ufb01er (before activation) by the residual block as\n\nfS (x) = fT (x) + \u2206f (x) ,\n\n(3)\nwhere we use functions fS and fT before softmax for residual block to ensure that the \ufb01nal classi\ufb01ers\nfs and ft will output probabilities. Residual layers f c1\u2013f c2 are fully-connected layers with c \u00d7 c\nunits, where c is the number of classes. We set the source classi\ufb01er fS as the outputs of the residual\nblock to make it better trainable from the source-labeled data by deep residual learning [8]. In other\nwords, if we set fT as the outputs of the residual block, then we may be unable to learn it successfully\nas we do not have target labeled data and thus standard back-propagation will not work. Deep residual\nlearning [8] ensures to output valid classi\ufb01ers |\u2206f (x)| (cid:28) |fT (x)| \u2248 |fS (x)|, and more importantly,\n\n4\n\nfccfcbfccfcbMMDAlexNet, ResNet\u2026XsXtweight layerweight layer+fc1fc2YsYtxentropy minimization\u0394fx()fSx()fTx()\u2716\u2795\u2716fSx()=fTx()+\u0394fx()MMD\u2716\u2716XsXsXtXtFx()=\u0394Fx()+x\u0394Fx()ZsZtfcbfcbfccfccfTx()\fmakes the perturbation function \u2206f (x) dependent on both the target classi\ufb01er fT (x) (due to the\nfunctional dependency) as well as the source classi\ufb01er fS(x) (due to the back-propagation pipeline).\nAlthough we successfully cast the classi\ufb01er adaptation into the residual learning framework while\nthe residual learning framework tends to make the target classi\ufb01er ft(x) not deviate much from the\nsource classi\ufb01er fs(x), we still cannot guarantee that ft(x) will \ufb01t the target-speci\ufb01c structures well.\nTo address this problem, we further exploit the entropy minimization principle [28] for re\ufb01ning the\nclassi\ufb01er adaptation, which encourages the low-density separation between classes by minimizing\ni; ft) on target domain data Dt as\nthe entropy of class-conditional distribution f t\n\ni = j|xt\n\ni) = p(yt\n\nj (xt\n\nH(cid:0)ft\n\n(cid:0)xt\n\ni\n\n(cid:1)(cid:1),\n\nnt(cid:88)\n\ni=1\n\nmin\nft\n\n1\nnt\n\n\u2212(cid:80)c\n\nwhere H(\u00b7) is the entropy function of class-conditional distribution ft (xt\n\ni)) =\ni) is the probability of predicting\npoint xt\ni to class j. By minimizing entropy penalty (4), the target classi\ufb01er ft(x) is made directly\naccessible to target-unlabeled data and will amend itself to pass through the target low-density regions.\n\ni), c is the number of classes, and f t\n\ni) de\ufb01ned as H (ft (xt\n\ni) log f t\n\nj=1 f t\n\nj (xt\n\nj (xt\n\nj (xt\n\n3.3 Residual Transfer Network\n\nTo enable effective unsupervised domain adaptation, we propose Residual Transfer Network (RTN),\nwhich jointly learns transferable features and adaptive classi\ufb01ers by integrating deep feature learning\n(1), feature adaptation (2), and classi\ufb01er adaptation (3)\u2013(4) in an end-to-end deep learning framework,\n\n(4)\n\n(5)\n\nns(cid:88)\nnt(cid:88)\n\ni=1\n\nmin\n\nfS =fT +\u2206f\n\n1\nns\n\nL (fs (xs\n\ni ) , ys\ni )\n\nH(cid:0)ft\n\n(cid:0)xt\n\n(cid:1)(cid:1)\n\n\u03b3\n+\nnt\n+ \u03bb DL (Ds,Dt),\n\ni=1\n\ni\n\nwhere \u03bb and \u03b3 are the tradeoff parameters for the tensor MMD penalty (2) and entropy penalty (4)\nrespectively. The proposed RTN model (5) is enabled to learn both transferable features and adaptive\nclassi\ufb01ers. As classi\ufb01er adaptation proposed in this paper and feature adaptation studied in [5, 6] are\ntailored to adapt different layers of deep networks, they can complement each other to establish better\nperformance. Since training deep CNNs requires a large amount of labeled data that is prohibitive for\nmany domain adaptation applications, we start with the CNN models pre-trained on ImageNet 2012\ndata and \ufb01ne-tune it as [5]. The training of RTN mainly follows standard back-propagation, with the\nresidual transfer layers for classi\ufb01er adaptation as [8]. Note that, the optimization of tensor MMD\npenalty (2) requires carefully-designed algorithm to establish linear-time training, as detailed in [5].\nWe also adopt bilinear pooling [29] to reduce the dimensions of fusion features in tensor MMD (2).\n\n4 Experiments\n\nWe evaluate the residual transfer network against state of the art transfer learning and deep learning\nmethods. Codes and datasets will be available at https://github.com/thuml/transfer-caffe.\n\n4.1 Setup\n\nOf\ufb01ce-31 [13] is a benchmark for domain adaptation, comprising 4,110 images in 31 classes collected\nfrom three distinct domains: Amazon (A), which contains images downloaded from amazon.com,\nWebcam (W) and DSLR (D), which contain images taken by web camera and digital SLR camera\nwith different photographical settings, respectively. To enable unbiased evaluation, we evaluate all\nmethods on all six transfer tasks A \u2192 W, D \u2192 W, W \u2192 D, A \u2192 D, D \u2192 A and W \u2192 A as in [5, 7].\nOf\ufb01ce-Caltech [14] is built by selecting the 10 common categories shared by Of\ufb01ce-31 and Caltech-\n256 (C), and is widely used by previous methods [14, 30]. We can build 12 transfer tasks: A \u2192 W,\nD \u2192 W, W \u2192 D, A \u2192 D, D \u2192 A, W \u2192 A, A \u2192 C, W \u2192 C, D \u2192 C, C \u2192 A, C \u2192 W, and C\n\u2192 D. While Of\ufb01ce-31 has more categories and is more dif\ufb01cult for domain adaptation algorithms,\n\n5\n\n\fOf\ufb01ce-Caltech provides more transfer tasks to enable an unbiased look at dataset bias [31]. We adopt\nDeCAF7 [2] features for shallow transfer methods and original images for deep adaptation methods.\nWe compare with both conventional and the state of the art transfer learning and deep learning methods:\nTransfer Component Analysis (TCA) [9], Geodesic Flow Kernel (GFK) [14], Deep Convolutional\nNeural Network (AlexNet [26]), Deep Domain Confusion (DDC) [4], Deep Adaptation Network\n(DAN) [5], and Reverse Gradient (RevGrad) [6]. TCA is a conventional transfer learning method\nbased on MMD-regularized Kernel PCA. GFK is a manifold learning method that interpolates across\nan in\ufb01nite number of intermediate subspaces to bridge domains. DDC is the \ufb01rst method that\nmaximizes domain invariance by adding to AlexNet an adaptation layer using linear-kernel MMD\n[27]. DAN learns more transferable features by embedding deep features of multiple task-speci\ufb01c\nlayers to reproducing kernel Hilbert spaces (RKHSs) and matching different distributions optimally\nusing multi-kernel MMD. RevGrad improves domain adaptation by making the source and target\ndomains indistinguishable for a discriminative domain classi\ufb01er via an adversarial training paradigm.\nTo go deeper with the ef\ufb01cacy of classi\ufb01er adaptation (residual transfer block) and feature adaptation\n(tensor MMD module), we perform ablation study by evaluating several variants of RTN: (1) RTN\n(mmd), which adds the tensor MMD module to AlexNet; (2) RTN (mmd+ent), which further adds\nthe entropy penalty to AlexNet; (3) RTN (mmd+ent+res), which further adds the residual module to\nAlexNet. Note that RTN (mmd) improves DAN [5] by replacing the multiple MMD penalties in DAN\nby a single tensor MMD penalty in RTN (mmd), which facilitates much easier parameter selection.\nWe follow standard protocols and use all labeled source data and all unlabeled target data for domain\nadaptation [5]. We compare average classi\ufb01cation accuracy of each transfer task using three random\nexperiments. For MMD-based methods (TCA, DDC, DAN, and RTN), we use Gaussian kernel with\nbandwidth b set to median pairwise squared distances on training data, i.e. median heuristic [27]. As\nthere are no target labeled data in unsupervised domain adaptation, model selection proves dif\ufb01cult.\nFor all methods, we perform cross-valuation on labeled source data to select candidate parameters,\nthen conduct validation on transfer task A \u2192 W by requiring one labeled example per category from\ntarget domain W as the validation set, and \ufb01x the selected parameters throughout all transfer tasks.\nWe implement all deep methods based on the Caffe deep-learning framework, and \ufb01ne-tune from\nCaffe-provided models of AlexNet [26] pre-trained on ImageNet. For RTN, We \ufb01ne-tune all the\nfeature layers, train bottleneck layer f cb, classi\ufb01er layer f cc and residual layers f c1\u2013f c2, all through\nstandard back-propagation. Since these new layers are trained from scratch, we set their learning rate\nto be 10 times that of the other layers. We use mini-batch stochastic gradient descent (SGD) with\nmomentum of 0.9 and the learning rate annealing strategy implemented in RevGrad [6]: the learning\nrate is not selected through a grid search due to high computational cost\u2014it is adjusted during SGD\nusing the following formula: \u03b7p = \u03b70\n(1+\u03b1p)\u03b2 , where p is the training progress linearly changing from\n0 to 1, \u03b70 = 0.01, \u03b1 = 10 and \u03b2 = 0.75, which is optimized for low error on the source domain.\nAs RTN can work stably across different transfer tasks, the MMD penalty parameter \u03bb and entropy\npenalty \u03b3 are \ufb01rst selected on A \u2192 W and then \ufb01xed as \u03bb = 0.3, \u03b3 = 0.3 for all other transfer tasks.\n\n4.2 Results\n\nThe classi\ufb01cation accuracy results on the six transfer tasks of Of\ufb01ce-31 are shown in Table 1, and the\nresults on the twelve transfer tasks of Of\ufb01ce-Caltech are shown in Table 2. The RTN model based\non AlexNet (Figure 1) outperforms all comparison methods on most transfer tasks. In particular,\nRTN substantially improves the accuracy on hard transfer tasks, e.g. A \u2192 W and C \u2192 W, where the\nsource and target domains are very different, and achieves comparable accuracy on easy transfer tasks,\nD \u2192 W and W \u2192 D, where source and target domains are similar [13]. These results suggest that\nRTN is able to learn more adaptive classi\ufb01ers and transferable features for safer domain adaptation.\nFrom the results, we can make interesting observations. (1) Standard deep-learning methods (AlexNet)\nperform comparably with traditional shallow transfer-learning methods with deep DeCAF7 features\nas input (TCA and GFK). The only difference between these two sets of methods is that AlexNet can\ntake the advantage of supervised \ufb01ne-tuning on the source-labeled data, while TCA and GFK can\ntake bene\ufb01ts of their domain adaptation procedures. This result con\ufb01rms the current practice that\nsupervised \ufb01ne-tuning is important for transferring source classi\ufb01er to target domain [19], and sustains\nthe recent discovery that deep neural networks learn abstract feature representation, which can only\nreduce, but not remove, the cross-domain discrepancy [3]. This reveals that the two worlds of deep\n\n6\n\n\fTable 1: Accuracy on Of\ufb01ce-31 dataset using standard protocol [5] for unsupervised adaptation.\n\nMethod\nTCA [9]\nGFK [14]\n\nAlexNet [26]\n\nDDC [4]\nDAN [5]\n\nRevGrad [6]\nRTN (mmd)\n\nRTN (mmd+ent)\n\nRTN (mmd+ent+res)\n\nA \u2192 W\n59.0\u00b10.0\n58.4\u00b10.0\n60.6\u00b10.4\n61.0\u00b10.5\n68.5\u00b10.3\n73.0\u00b10.6\n70.0\u00b10.4\n71.2\u00b10.3\n73.3\u00b10.3\n\nD \u2192 W\n90.2\u00b10.0\n93.6\u00b10.0\n95.4\u00b10.2\n95.0\u00b10.3\n96.0\u00b10.1\n96.4\u00b10.4\n96.1\u00b10.3\n96.4\u00b10.2\n96.8\u00b10.2\n\nW \u2192 D\n88.2\u00b10.0\n91.0\u00b10.0\n99.0\u00b10.1\n98.5\u00b10.3\n99.0\u00b10.1\n99.2\u00b10.3\n99.2\u00b10.3\n99.2\u00b10.1\n99.6\u00b10.1\n\nA \u2192 D\n57.8\u00b10.0\n58.6\u00b10.0\n64.2\u00b10.3\n64.9\u00b10.4\n66.8\u00b10.2\n67.6\u00b10.4\n69.8\u00b10.2\n71.0\u00b10.2\n\n-\n\nD \u2192 A\n51.6\u00b10.0\n52.4\u00b10.0\n45.5\u00b10.5\n47.2\u00b10.5\n50.0\u00b10.4\n49.8\u00b10.4\n50.2\u00b10.3\n50.5\u00b10.3\n\n-\n\nW \u2192 A\n47.9\u00b10.0\n46.1\u00b10.0\n48.3\u00b10.5\n49.4\u00b10.6\n49.8\u00b10.3\n50.0\u00b10.3\n50.7\u00b10.2\n51.0\u00b10.1\n\n-\n\nAvg\n65.8\n66.7\n68.8\n69.3\n71.7\n\n-\n\n72.1\n72.9\n73.7\n\nMethod\nTCA [9]\nGFK [14]\n\nTable 2: Accuracy on Of\ufb01ce-Caltech dataset using standard protocol [5] for unsupervised adaptation.\nA\u2192W D\u2192W W\u2192D A\u2192D D\u2192A W\u2192A A\u2192C W\u2192C D\u2192C C\u2192A C\u2192W C\u2192D Avg\n87.9 87.0\n84.4\n79.6 92.1\n81.2 75.5\n77.9 90.7\n89.5\n77.1 85.5\n77.1\n76.2\n79.0 91.9 83.7 87.1 86.1\n79.5\n73.0\n83.0\n79.2 91.9 85.4 88.8 87.1\n83.1\n73.4\n83.5\n91.8\n80.3 92.0 90.6 89.3 90.1\n84.1\n81.2\n84.0\n93.2\n90.4 90.0\n80.4 91.0\n81.3\n87.8 84.8 83.4 93.2 96.6 93.9 92.6\n93.8\nRTN (mmd+ent+res) 95.2\n88.1 86.6 84.6 93.7 96.9 94.2 93.4\n\n82.8 90.4\n96.9\n97.0\n86.0 89.8\n97.7 100.0 87.4 87.1\n98.1 100.0 88.4 89.0\n98.5 100.0 91.7 90.0\n98.5 100.0 91.7 88.0\n98.6 100.0 92.9 93.6\n99.2 100.0 95.5 93.8\n\n85.6\n88.5\n83.8\n84.9\n92.1\n90.7\n92.7\n92.5\n\nRTN (mmd)\n\nRTN (mmd+ent)\n\nDDC [4]\nDAN [5]\n\nAlexNet [26]\n\n88.1\n78.0\n\n99.4\n98.1\n\n89.8\n\nlearning and domain adaptation cannot reinforce each other substantially in the two-step pipeline,\nwhich motivates carefully-designed deep adaptation architectures to unify them. (2) Deep-transfer\nlearning methods that reduce the domain discrepancy by domain-adaptive deep networks (DDC, DAN\nand RevGrad) substantially outperform standard deep learning methods (AlexNet) and traditional\nshallow transfer-learning methods with deep features as the input (TCA and GFK). This con\ufb01rms\nthat incorporating domain-adaptation modules into deep networks can improve domain adaptation\nperformance. By adapting source-target distributions in multiple task-speci\ufb01c layers using optimal\nmulti-kernel two-sample matching, DAN performs the best in general among the prior deep-transfer\nlearning methods. (3) The proposed residual transfer network (RTN) performs the best and sets up a\nnew state of the art result on these benchmark datasets. Different from all the previous deep-transfer\nlearning methods that only adapt the feature layers of deep neural networks to learn more transferable\nfeatures, RTN further adapts the classi\ufb01er layers to bridge the source and target classi\ufb01ers in an\nend-to-end residual learning framework, which can correct the classi\ufb01er mismatch more effectively.\nTo go deeper into different modules of RTN, we show the results of RTN variants in Tables 1 and 2. (1)\nRTN (mmd) slightly outperforms DAN, but RTN (mmd) has only one MMD penalty parameter while\nDAN has two or three. Thus the proposed tensor MMD module is effective for adapting multiple\nfeature layers using a single MMD penalty, which is important for easy model selection. (2) RTN\n(mmd+ent) performs substantially better than RTN (mmd). This highlights the importance of entropy\nminimization for low-density separation, which exploits the cluster structure of target-unlabeled\ndata such that the target-classi\ufb01er can be better adapted to the target data. (3) RTN (mmd+ent+res)\nperforms the best across all variants. This highlights the importance of residual transfer of classi\ufb01er\nlayers for learning more adaptive classi\ufb01ers. This is critical as in practical applications, there is no\nguarantee that the source classi\ufb01er and target classi\ufb01er can be safely shared. It is worth noting that,\nthe entropy penalty and the residual module should be used together, otherwise the residual function\ntends to learn useless zero mapping such that the source and target classi\ufb01ers are nearly identical [8].\n\n4.3 Discussion\n\nPredictions Visualization: We respectively visualize in Figures 2(a)\u20132(d) the t-SNE embeddings [2]\nof the predictions by DAN and RTN on transfer task A \u2192 W. We can make the following observations.\n(1) The predictions made by DAN in Figure 2(a)\u20132(b) show that the target categories are not well\ndiscriminated by the source classi\ufb01er, which implies that target data is not well compatible with the\nsource classi\ufb01er. Hence the source and target classi\ufb01ers should not be assumed to be identical, which\nhas been a common assumption made by all prior deep domain adaptation methods [4, 5, 6, 7]. (2)\nThe predictions made by RTN in Figures 2(c)\u20132(d) show that the target categories are discriminated\n\n7\n\n\f(a) DAN: Source=A\n\n(b) DAN: Target=W\n\n(c) RTN: Source=A\n\n(d) RTN: Target=W\n\nFigure 2: Visualization: (a)-(b) t-SNE of DAN predictions; (c)-(d) t-SNE of RTN predictions.\n\n(a) Layer Responses\n\n(b) Classi\ufb01er Shift\n\n(c) Accuracy w.r.t. \u03b3\n\nFigure 3: (a) layer responses; (b) classi\ufb01er shift; (c) sensitivity of \u03b3 (dashed lines show best baselines).\n\nbetter by the target classi\ufb01er (larger class-to-class distances), which suggests that residual transfer of\nclassi\ufb01ers is a reasonable extension to previous deep feature adaptation methods. RTN simultaneously\nlearns more adaptive classi\ufb01ers and more transferable features to enable effective domain adaptation.\nLayer Responses: We show in Figure 3(a) the means and standard deviations of the layer responses\n[8], which are the outputs of fT (x) (f cc layer), \u2206f (x) (f c2 layer), and fS(x) (after element-wise\nsum operator), respectively. This exposes the response strength of the residual functions. The results\nshow that the residual function \u2206f (x) have generally much smaller responses than the shortcut\nfunction fT (x). These results support our motivation that the residual functions are generally smaller\nthan the non-residual functions, as they characterize the small gap between the source classi\ufb01er and\ntarget classi\ufb01er. The small residual function can be learned effectively via deep residual learning [8].\nClassi\ufb01er Shift: To justify that there exists a classi\ufb01er shift between source classi\ufb01er fs and target\nclassi\ufb01er ft, we train fs on source domain and ft on target domain, both provided with labeled data.\nBy taking A as source domain and W as target domain, the weight parameters of the classi\ufb01ers (e.g.\nsoftmax regression) are shown in Figure 3(b), which shows that fs and ft are substantially different.\nParameter Sensitivity: We check the sensitivity of entropy parameter \u03b3 on transfer tasks A \u2192 W\n(31 classes) and C \u2192 W (10 classes) by varying the parameter in {0.01, 0.04, 0.07, 0.1, 0.4, 0.7, 1.0}.\nThe results are shown in Figures 3(c), with the best results of the baselines shown as dashed lines.\nThe accuracy of RTN \ufb01rst increases and then decreases as \u03b3 varies and demonstrates a desirable\nbell-shaped curve. This justi\ufb01es our motivation of jointly learning transferable features and adaptive\nclassi\ufb01ers by the RTN model, as a good trade-off between them can promote transfer performance.\n\n5 Conclusion\n\nThis paper presented a novel approach to unsupervised domain adaptation in deep networks, which\nenables end-to-end learning of adaptive classi\ufb01ers and transferable features. Similar to many prior\ndomain adaptation techniques, feature adaptation is achieved by matching the distributions of features\nacross domains. However, unlike previous work, the proposed approach also supports classi\ufb01er\nadaptation, which is implemented through a new residual transfer module that bridges the source\nclassi\ufb01er and target classi\ufb01er. This makes the approach a good complement to existing techniques. The\napproach can be trained by standard back-propagation, which is scalable and can be implemented by\nmost deep learning package. Future work constitutes semi-supervised domain adaptation extensions.\n\nAcknowledgments\n\nThis work was supported by the National Natural Science Foundation of China (61502265, 61325008),\nNational Key R&D Program of China (2016YFB1000701, 2015BAF32B01), and TNList Key Project.\n\n8\n\nmeandeviationStatistics02468Layer ResponsesfS(x)fT(x)\u2206f(x)102030Class-1.5-1-0.5Weight Parametersfsft0.010.040.070.1 0.4 0.7 1 \u03b35060708090100Average Accuracy (%)A\u2192WC\u2192W\fReferences\n[1] S. J. Pan and Q. Yang. A survey on transfer learning. TKDE, 22(10):1345\u20131359, 2010.\n[2] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional\n\nactivation feature for generic visual recognition. In ICML, 2014.\n\n[3] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In\n\nNIPS, 2014.\n\n[4] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep domain confusion: Maximizing for\n\ndomain invariance. 2014.\n\n[5] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks.\n\nIn ICML, 2015.\n\n[6] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, 2015.\n[7] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Simultaneous deep transfer across domains\n\nand tasks. In ICCV, 2015.\n\n[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n[9] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis.\n\nTNNLS, 22(2):199\u2013210, 2011.\n\n[10] L. Duan, I. W. Tsang, and D. Xu. Domain transfer multiple kernel learning. TPAMI, 34(3):465\u2013479, 2012.\n[11] K. Zhang, B. Sch\u00f6lkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift.\n\nIn ICML, 2013.\n\n[12] X. Wang and J. Schneider. Flexible transfer learning under support and model shift. In NIPS, 2014.\n[13] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV,\n\n2010.\n\n[14] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation. In\n\nCVPR, 2012.\n\n[15] J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell, and K. Saenko. LSDA:\n\nLarge scale detection through adaptation. In NIPS, 2014.\n\n[16] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing\n\n(almost) from scratch. JMLR, 12:2493\u20132537, 2011.\n\n[17] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. TPAMI,\n\n35(8):1798\u20131828, 2013.\n\n[18] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classi\ufb01cation: A deep\n\nlearning approach. In ICML, 2011.\n\n[19] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations\n\nusing convolutional neural networks. In CVPR, June 2013.\n\n[20] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In ICML, 2011.\n[21] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In\n\nCOLT, 2009.\n\n[22] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning\n\nfrom different domains. MLJ, 79(1-2):151\u2013175, 2010.\n\n[23] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In\n\nMM, pages 188\u2013197. ACM, 2007.\n\n[24] Lixin Duan, Ivor W Tsang, Dong Xu, and Tat-Seng Chua. Domain adaptation from multiple sources via\n\nauxiliary classi\ufb01ers. In ICML, pages 289\u2013296. ACM, 2009.\n\n[25] L. Duan, D. Xu, I. W. Tsang, and J. Luo. Visual event recognition in videos by learning from web data.\n\nTPAMI, 34(9):1667\u20131680, 2012.\n\n[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[27] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel two-sample test. JMLR,\n\n13:723\u2013773, March 2012.\n\n[28] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In NIPS, 2004.\n[29] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for \ufb01ne-grained visual\n\nrecognition. In CVPR, pages 1449\u20131457, 2015.\n\n[30] B. Sun, J. Feng, and K. Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.\n[31] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR, 2011.\n\n9\n\n\f", "award": [], "sourceid": 99, "authors": [{"given_name": "Mingsheng", "family_name": "Long", "institution": "Tsinghua University"}, {"given_name": "Han", "family_name": "Zhu", "institution": "Tsinghua University"}, {"given_name": "Jianmin", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}