{"title": "Adversarial Multiple Source Domain Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 8559, "page_last": 8570, "abstract": "While domain adaptation has been actively researched, most algorithms focus on the single-source-single-target adaptation setting. In this paper we propose new generalization bounds and algorithms under both classification and regression settings for unsupervised multiple source domain adaptation. Our theoretical analysis naturally leads to an efficient learning strategy using adversarial neural networks: we show how to interpret it as learning feature representations that are invariant to the multiple domain shifts while still being discriminative for the learning task. To this end, we propose multisource domain adversarial networks (MDAN) that approach domain adaptation by optimizing task-adaptive generalization bounds. To demonstrate the effectiveness of MDAN, we conduct extensive experiments showing superior adaptation performance on both classification and regression problems: sentiment analysis, digit classification, and vehicle counting.", "full_text": "Adversarial Multiple Source Domain Adaptation\n\nHan Zhao\u2020\u21e4, Shanghang Zhang\u2020\u2021\u21e4, Guanhang Wu\u2020\n\nJo\u02dcao P. Costeira\u2021, Jos\u00b4e M. F. Moura\u2020, Geoffrey J. Gordon\u2020\n\u2021IST, Universidade de Lisboa\n\u2020Carnegie Mellon University\n\n{hzhao1,shanghaz,guanhanw,moura,ggordon}@andrew.cmu.edu,\n\njpc@isr.ist.utl.pt\n\nAbstract\n\nWhile domain adaptation has been actively researched, most algorithms focus on\nthe single-source-single-target adaptation setting. In this paper we propose new\ngeneralization bounds and algorithms under both classi\ufb01cation and regression set-\ntings for unsupervised multiple source domain adaptation. Our theoretical analysis\nnaturally leads to an ef\ufb01cient learning strategy using adversarial neural networks:\nwe show how to interpret it as learning feature representations that are invariant\nto the multiple domain shifts while still being discriminative for the learning task.\nTo this end, we propose multisource domain adversarial networks (MDAN) that\napproach domain adaptation by optimizing task-adaptive generalization bounds.\nTo demonstrate the effectiveness of MDAN, we conduct extensive experiments\nshowing superior adaptation performance on both classi\ufb01cation and regression\nproblems: sentiment analysis, digit classi\ufb01cation, and vehicle counting.\n\n1\n\nIntroduction\n\nThe success of machine learning has been partially attributed to rich datasets with abundant annota-\ntions [40]. Unfortunately, collecting and annotating such large-scale training data is prohibitively\nexpensive and time-consuming. To solve these limitations, different labeled datasets can be combined\nto build a larger one, or synthetic training data can be generated with explicit yet inexpensive annota-\ntions [41]. However, due to the possible shift between training and test samples, learning algorithms\nbased on these cheaper datasets still suffer from high generalization error. Domain adaptation (DA)\nfocuses on such problems by establishing knowledge transfer from a labeled source domain to an un-\nlabeled target domain, and by exploring domain-invariant structures and representations to bridge the\ngap [38]. Both theoretical results [8, 22, 32, 33, 49] and algorithms [1, 2, 6, 19, 20, 23, 25, 26, 30, 39]\nfor DA have been proposed. Most theoretical results and algorithms with respect to DA focus on\nthe single-source-single-target setting [17, 31, 42, 45, 46]. However, in many application scenarios,\nthe labeled data available may come from multiple domains with different distributions. As a result,\nnaive application of the single-source-single-target DA algorithms may lead to suboptimal solutions.\nSuch problem calls for an ef\ufb01cient technique for multiple source domain adaptation. Some existing\nmultisource DA methods [16, 23, 24, 43, 51] cannot lead to effective deep learning based algorithms,\nleaving much space to be improved for their performance.\nIn this paper, we analyze the multiple source domain adaptation problem and propose an adversarial\nlearning strategy based on our theoretical results. Speci\ufb01cally, we give new generalization bounds for\nboth classi\ufb01cation and regression problems under domain adaptation when there are multiple source\ndomains with labeled instances and one target domain with unlabeled instances. Our theoretical\nresults build on the seminal theoretical model for domain adaptation introduced by Blitzer et al. [9]\nand Ben-David et al. [8], where a divergence measure, known as the H-divergence, was proposed\nto measure the distance between two distributions based on a given hypothesis space H. Our new\nresult generalizes the bound [8, Thm. 2] to the case when there are multiple source domains, and\n\n\u21e4The \ufb01rst two authors contributed equally to this work.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fto regression problems as well. The new bounds achieve a \ufb01nite sample error rate of \u02dcO(p1/km),\n\nwhere k is the number of source domains and m is the number of labeled training instances from\neach domain. We provide detailed comparisons with existing work in Section 3.\nInterestingly, our bounds also lead to an ef\ufb01cient algorithm using adversarial neural networks.\nThis algorithm learns both domain invariant and task discriminative features under multiple do-\nmains. Speci\ufb01cally, we propose a novel MDAN model by using neural networks as rich function\napproximators to instantiate the generalization bound we derive (Fig. 1). MDAN can be viewed\nas computationally ef\ufb01cient approximations to optimize the parameters of the networks in order to\nminimize the bounds. We introduce two versions of MDAN: The hard version optimizes directly a\nsimple worst-case generalization bound, while the soft version leads to a more data-ef\ufb01cient model\nand optimizes an average case and task-adaptive bound. The optimization of MDAN is a minimax\nsaddle point problem, which can be interpreted as a zero-sum game with two participants competing\nagainst each other to learn invariant features. MDAN combine feature extraction, domain classi\ufb01-\ncation, and task learning in one training process. We propose to use stochastic optimization with\nsimultaneous updates to optimize the parameters in each iteration.\nContributions. Our contributions are three-fold: 1). Theoretically, we provide average case gen-\neralization bounds for both classi\ufb01cation and regression problems under the multisource domain\nadaptation setting. 2). Inspired by our theoretical results, we also propose ef\ufb01cient algorithms that\ntackle multisource domain adaptation problems using adversarial learning strategy. 3). Empirically, to\ndemonstrate the effectiveness of MDAN as well as the relevance of our theoretical results, we conduct\nextensive experiments on real-world datasets, including both natural language and vision tasks,\nclassi\ufb01cation and regression problems. We achieve consistently superior adaptation performances on\nall the tasks, validating the effectiveness of our models.\n\n2 Preliminary\n\nWe \ufb01rst introduce the notations and review a theoretical model for domain adaptation when there is\none source and one target [7\u20139, 27]. The key idea is the H-divergence to measure the discrepancy\nbetween two distributions. Other theoretical models for DA exist [12, 13, 33, 35]; we choose to work\nwith the above model because this distance measure has a particularly natural interpretation and can\nbe well approximated using samples from both domains.\nNotations We use domain to represent a distribution D on input space X and a labeling function\nf : X! [0, 1]. In the setting of one source one target domain adaptation, we use hDS, fSi and\nhDT , fTi to denote the source and target, respectively. A hypothesis is a function h : X! [0, 1]. The\nerror of a hypothesis h w.r.t. a labeling function f under distribution DS is de\ufb01ned as: \"S(h, f ) :=\nEx\u21e0DS [|h(x) f (x)|]. When f and h are binary classi\ufb01cation functions, this de\ufb01nition reduces\nto the probability that h disagrees with f under DS: Ex\u21e0DS [|h(x) f (x)|] = Ex\u21e0DS [I(f (x) 6=\nh(x))] = Prx\u21e0DS (f (x) 6= h(x)).\nWe de\ufb01ne the risk of hypothesis h as the error of h w.r.t. a true labeling function under domain DS,\n\ni.e., \"S(h) := \"S(h, fS). As common notation in computational learning theory, we useb\"S(h) to\ndenote the empirical risk of h on the source domain. Similarly, we use \"T (h) andb\"T (h) to mean the\ntrue risk and the empirical risk on the target domain. H-divergence is de\ufb01ned as follows:\nDe\ufb01nition 1. Let H be a hypothesis class for instance space X , and AH be the collection of subsets\nof X that are the support of some hypothesis in H, i.e., AH := {h1({1}) | h 2H} . The distance\nbetween two distributions D and D0 based on H is: dH(D,D0) := 2 supA2AH | PrD(A) PrD0(A)|.\nWhen the hypothesis class H contains all the possible measurable functions over X , dH(D,D0)\nreduces to the familiar total variation. Given a hypothesis class H, we de\ufb01ne its symmetric difference\nw.r.t. itself as: HH = {h(x) h0(x) | h, h0 2H} , where is the XOR operation. Let h\u21e4 be\nthe optimal hypothesis that achieves the minimum combined risk on both the source and the target\ndomains: h\u21e4 := arg minh2H \"S(h) + \"T (h), and use to denote the combined risk of the optimal\nhypothesis h\u21e4: := \"S(h\u21e4) + \"T (h\u21e4). Ben-David et al. [7] and Blitzer et al. [9] proved the following\ngeneralization bound on the target risk in terms of the source risk and the discrepancy between the\nsingle source domain and the target domain:\n\n2\n\n\fTheorem 1 ([9]). Let H be a hypothesis space of V C-dimension d and bDS (bDT ) be the empirical\ndistribution induced by sample of size m drawn from DS (DT ). Then w.p.b. at least 1 , 8h 2H ,\n(1)\n\ndHH(bDS,bDT ) + + O r d log(m/d) + log(1/)\n\n\"T (h) \uf8ffb\"S(h) +\n\nThe bound depends on , the optimal combined risk that can be achieved by hypothesis in H. The\nintuition is if is large, we cannot hope for a successful domain adaptation. One notable feature is\nthat the empirical discrepancy distance between two samples can be approximated by a discriminator\nto distinguish instances from two domains.\n\n!\n\n1\n2\n\nm\n\n3 Generalization Bound for Multiple Source Domain Adaptation\n\nm\n\nk\n\n\n(2)\n\n+ d log\n\nd\u25c6!\n\n\"T (h) \uf8ff max\n\nIn this section we discuss two approaches to obtain generalization guarantees for multiple source\ndomain adaptation in both classi\ufb01cation and regression settings, one by a union bound argument and\none using reduction from multiple source domains to single source domain. We conclude this section\nwith a discussion and comparison of our bounds with existing generalization bounds for multisource\ndomain adaptation [8, 35]. We refer readers to appendix for proof details and we mainly focus on\ndiscussing the interpretations and implications of the theorems.\ni=1 and DT be k source domains and the target domain, respectively. One idea to obtain a\nLet {DSi}k\ngeneralization bound for multiple source domains is to apply Thm. 1 repeatedly k times, followed\nby a union bound to combine them. Following this idea, we \ufb01rst obtain the following bound as a\ncorollary of Thm. 1 in the setting of multiple source domains, serving as a baseline model:\nCorollary 1 (Worst case classi\ufb01cation bound). Let H be a hypothesis class with V Cdim(H) = d. If\nbDT and {bDSi}k\ni=1 are the empirical distributions generated with m i.i.d. samples from each domain,\nthen, for 0 << 1, with probability at least 1 , for all h 2H , we have:\ndHH(bDT ;bDSi) + i + O s 1\nm\u2713log\n\ni2[k]\u21e2b\"Si(h) +\n\nwhere i is the combined risk of the optimal hypothesis on domains Si and T .\nThis bound is quite pessimistic, as it essentially is a worst case bound, where the generalization\non the target only depends on the worst source domain. However, in many real-world scenarios,\nwhen the number of related source domains is large, a single irrelevant source domain may not\nhurt the generalization too much. Furthermore, in the case of multiple source domains, despite the\npossible discrepancy between the source domains and the target domain, effectively we have a labeled\n\nnaturally one interesting question to ask is: is it possible to have a generalization bound of \ufb01nite\n\nsample of size km, while the asymptotic convergence rate in Corollary. 1 is of \u02dcO(p1/m). Hence\nsample rate \u02dcO(p1/km)? In what follows we present a strategy to achieve a generalization bound of\nrate \u02dcO(p1/km). The idea of this strategy is a reduction using convex combination from multiple\ndomains to single domain by combining all the labeled instances from k domains to one.\nTheorem 2 (Average case classi\ufb01cation bound). Let H be a hypothesis class with V Cdim(H) = d.\nIf {bDSi}k\ni=1 are the empirical distributions generated with m i.i.d. samples from each domain, and\nbDT is the empirical distribution on the target domain generated from mk samples without labels,\n+,Pi2[k] \u21b5i = 1, and for 0 << 1, w.p.b. at least 1 , for all h 2H , we have:\nthen, 8\u21b5 2 Rk\nd \u25c6! (3)\n\u21b5i\u2713b\"Si(h) +\n\"T (h) \uf8ff Xi2[k]\nwhere \u21b5 is the risk of the optimal hypothesis on the mixture source domainPi2[k] \u21b5iSi and T .\nDifferent from Corollary 1, Thm. 2 requires mk unlabeled instances from the target domain. This is\na mild requirement since unlabeled data is cheap to collect. Roughly, the bound in Thm. 2 can be\nunderstood as an average case bound if we choose \u21b5i = 1/k,8i 2 [k]. Note that a simple convex\n\ndHH(bDT ;bDSi)\u25c6 + \u21b5 + O s 1\n\nkm\u2713log\n\nkm\n\n+ d log\n\n1\n\n\n1\n2\n\n1\n2\n\n3\n\n\f\u21b5i\u2713b\"Si(h) +\n\nd \u00afH(bDT ;bDSi)\u25c6 + \u21b5 + O s 1\n\ncombination by applying Thm. 1 k times can only achieve \ufb01nite sample rate of \u02dcO(p1/m), while\nthe one in (3) achieves \u02dcO(p1/km). On the other hand, the constants maxi2[k] i (in Corollary 1)\n\nand \u21b5 (in Thm. 2) are generally not comparable. As a \ufb01nal note, although the proof works for any\nconvex combination \u21b5i, in the next section we will describe a practical method so that we do not need\nto explicitly choose it. Thm. 2 upper bounds the generalization error for classi\ufb01cation problems. Next\nwe also provide generalization guarantee for regression problem, where instead of VC dimension, we\nuse pseudo-dimension to characterize the structural complexity of the hypothesis class.\nTheorem 3 (Average case regression bound). Let H be a set of real-valued functions from X to\n[0, 1]2 with P dim(H) = d. If {bDSi}k\ni=1 are the empirical distributions generated with m i.i.d.\nsamples from each domain, and bDT is the empirical distribution on the target domain generated from\n+,Pi2[k] \u21b5i = 1, and for 0 << 1, with probability at\nmk samples without labels, then, 8\u21b5 2 Rk\nleast 1 , for all h 2H , we have:\nd \u25c6! (4)\n\"T (h) \uf8ff Xi2[k]\n1\n2\nwhere \u21b5 is the risk of the optimal hypothesis on the mixture source domainPi2[k] \u21b5iSi and T , and\n\u00afH := {I|h(x)h0(x)|>t : h, h0 2H , 0 \uf8ff t \uf8ff 1} is the set of threshold functions induced from H.\nComparison with Existing Bounds. First, it is easy to see that, the bounds in both (2) and (3)\nreduce to the one in Thm. 1 when there is only one source domain (k = 1). Blitzer et al. [9]\ngive a generalization bound for semi-supervised classi\ufb01cation with multiple sources where, besides\nlabeled instances from multiple source domains, the algorithm also has access to a fraction of labeled\ninstances from the target domain. Although in general our bound and the one in [9, Thm. 3] are\nincomparable, it is instructive to see the connections and differences between them: our bound works\nin the unsupervised domain adaptation setting where we do not have any labeled data from the target.\nAs a comparison, their bound in [9, Thm. 3] is a bound for semi-supervised domain adaptation. As a\nresult, because of the access to labeled instances from the target domain, their bound is expressed\nrelative to the optimal error on the target, while ours is in terms of the empirical error on the source\ndomains, hence theirs is more informative. To the best of our knowledge, our bound in Thm. 3 is the\n\ufb01rst one using the idea of H-divergence for regression problems. The proof of this theorem relies\non a reduction from regression to classi\ufb01cation. Mansour et al. [34] give a generalization bound\nfor multisource domain adaptation under the assumption that the target distribution is a mixture of\nthe k sources and the target hypothesis can be represented as a convex combination of the source\nhypotheses. Also, their generalized discrepancy measure can be applied for other loss functions.\n\nkm\u2713log\n\n1\n\n\nkm\n\n+ d log\n\n4 Multisource Domain Adaptation with Adversarial Neural Networks\n\nMotivated by the bounds given in the last section, in this section we propose our model, multisource\ndomain adversarial networks (MDAN), with two versions: Hard version (as a baseline) and Soft\nversion. Suppose we are given samples drawn from k source domains {DSi}, each of which contains\nm instance-label pairs. Additionally, we also have access to unlabeled instances sampled from the\ntarget domain DT . Once we \ufb01x our hypothesis class H, the last two terms in the generalization\nbounds (2) and (3) will be \ufb01xed; hence we can only hope to minimize the bound by minimizing\nthe \ufb01rst two terms, i.e., the source training error and the discrepancy between source domains and\ntarget domain. The idea is to train a neural network to learn a representation with the following two\nproperties: 1). indistinguishable between the k source domains and the target domain; 2). informative\nenough for our desired task to succeed. Note that both requirements are necessary: without the\nsecond property, a neural network can learn trivial random noise representations for all the domains,\nand such representations cannot be distinguished by any discriminator; without the \ufb01rst property, the\nlearned representation does not necessarily generalize to the unseen target domain.\n\net al. [7] is that computing the discrepancy measure is closely related to learning a classi\ufb01er that is\n\nOne key observation that leads to a practical approximation of dHH(bDT ;bDSi) from Ben-David\nable to distinguish samples from different domains. Letb\"T,Si(h) be the empirical risk of hypothesis\n\n2This is just for the simplicity of presentation, the range can easily be generalized to any bounded set.\n\n4\n\n\fFigure 1: MDAN Network architecture. Feature extractor, domain classi\ufb01er, and task learning are\ncombined in one training process. Hard version: the source that achieves the minimum domain classi-\n\ufb01cation error is backpropagated with gradient reversal; Smooth version: all the domain classi\ufb01cation\nrisks over k source domains are combined and backpropagated adaptively with gradient reversal.\nh in the domain discriminating task. Ignoring the constant terms that do not affect the upper bound,\nwe can minimize the worst case upper bound in (2) by solving the following optimization problem:\n\nHard version: minimize max\n\n(5)\n\ni2[k]\u2713b\"Si(h) min\n\nh02HHb\"T,Si(h0)\u25c6\n\nThe two terms in (5) exactly correspond to the two criteria we just proposed: the \ufb01rst term asks for an\ninformative feature representation for our desired task to succeed, while the second term captures the\nnotion of invariant feature representations between different domains. Inspired by Ganin et al. [17],\nwe use the gradient reversal layer to effectively implement (5) by backpropagation. The network\narchitecture is shown in Figure. 1. As discussed in the last section, one notable drawback of the hard\nversion is that the algorithm may spend too much computational resources in optimizing the worst\nsource domain. Furthermore, in each iteration the algorithm only updates its parameter based on\nthe gradient from one of the k domains. This is data inef\ufb01cient and can waste our computational\nresources in the forward process.\nTo avoid both of the problems, we propose the MDAN Soft version that optimizes an upper\n\nlogXi2[k]\n\nSoft version: minimize\n\nAt the \ufb01rst glance, it may not be clear what the above objective function corresponds to. To understand\n\nbound of the convex combination bound given in (3). To this end, de\ufb01ne b\"i(h) := b\"Si(h) \nminh02HHb\"T,Si(h0) and let > 0 be a constant. We formulate the following optimization problem:\nexp\u2713(b\"Si(h) min\nh02HHb\"T,Si(h0))\u25c6\nthis, if we de\ufb01ne \u21b5i = exp(b\"i(h))/Pj2[k] exp(b\"j(h)), then the following inequality holds:\nPi2[k] exp(b\"i(h))! \uf8ff logXi2[k]\n\u21b5ib\"i(h) \uf8ff logE\u21b5[exp(b\"i(h))] = log Pi2[k] exp2(b\"i(h))\nXi2[k]\nexp(b\"i(h))\n\nIn other words, the objective function in (6) is in fact an upper bound of the convex combination\nbound given in (3), with the combination weight \u21b5 de\ufb01ned above. Compared with the one in (3), one\nadvantage of the objective function in (6) is that we do not need to explicitly choose the value of \u21b5.\n\n(6)\n\n1\n\n\n1\n\n\n@\n@\u2713\n\nAlternatively, from the algorithmic perspective, during the optimization (6) naturally provides an\nadaptive weighting scheme for the k source domains depending on their relative error. Use \u2713 to\ndenote all the model parameters:\n\nInstead, it adaptively corresponds to the lossb\"i(h), and the larger the loss, the heavier the weight.\n@b\"i(h)\n\nh02HHb\"T,Si(h0))\u25c6 = Xi2[k]\n\nexp\u2713(b\"Si(h) min\n\nCompared with (5), the log-sum-exp trick not only smooths the objective, but also provides a\nprincipled and adaptive way to combine all the gradients from the k source domains. In words, (7)\nsays that the gradient of MDAN is a convex combination of the gradients from all the domains. The\nlarger the error from one domain, the larger the combination weight in the ensemble. As we will see\nin Sec. 5, the optimization problem (6) often leads to better generalizations in practice, which may\npartly be explained by the ensemble effect of multiple sources implied by the upper bound.\n\nexp b\"i(h)\nPi02[k] exp b\"i0(h)\n\nlogXi2[k]\n\n(7)\n\n@\u2713\n\n5\n\n\f5 Experiments\n\nWe evaluate both hard and soft MDAN and compare them with state-of-the-art methods on three\nreal-world datasets: the Amazon benchmark dataset [11] for sentiment analysis, a digit classi\ufb01cation\ntask that includes 4 datasets: MNIST [29], MNIST-M [17], SVHN [37], and SynthDigits [17], and a\npublic, large-scale image dataset on vehicle counting from multiple city cameras [52]. Due to space\nlimit, details about network architecture and training parameters of proposed and baseline methods,\nand detailed dataset description are described in appendix.\n\n5.1 Amazon Reviews\n\nDomains within the dataset consist of reviews on a speci\ufb01c kind of product (Books, DVDs, Electronics,\nand Kitchen appliances). Reviews are encoded as 5000 dimensional feature vectors of unigrams and\nbigrams, with binary labels indicating sentiment. We conduct 4 experiments: for each of them, we\npick one product as target domain and the rest as source domains. Each source domain has 2000\nlabeled examples, and the target test set has 3000 to 6000 examples. During training, we randomly\nsample the same number of unlabeled target examples as the source examples in each mini-batch.\nWe implement both the Hard-Max and Soft-Max methods, and compare them with three baselines:\nMLPNet, marginalized stacked denoising autoencoders (mSDA) [11], and DANN [17]. DANN\ncannot be directly applied in multiple source domains setting. In order to make a comparison, we\nuse two protocols. The \ufb01rst one is to combine all the source domains into a single one and train\nit using DANN, which we denote as C-DANN. The second protocol is to train multiple DANNs\nseparately, where each one corresponds to a source-target pair. Among all the DANNs, we report the\none achieving the best performance on the target domain. We denote this experiment as B-DANN.\nFor fair comparison, all these models are built on the same basic network structure with one input\nlayer (5000 units) and three hidden layers (1000, 500, 100 units).\n\nTable 1: Sentiment classi\ufb01cation accuracy.\n\nTrain/Test MLPNet mSDA B-DANN C-DANN\nD+E+K/B\n0.7789\nB+E+K/D\n0.7886\nB+D+K/E\n0.8491\nB+D+E/K\n0.8639\n\n0.7655\n0.7588\n0.8460\n0.8545\n\n0.7698\n0.7861\n0.8198\n0.8426\n\n0.7650\n0.7732\n0.8381\n0.8433\n\nMDAN\n\nHard-Max\n\n0.7845\n0.7797\n0.8483\n0.8580\n\nSoft-Max\n0.7863\n0.8065\n0.8534\n0.8626\n\nResults and Analysis. We show the accuracy of different methods in Table 1. Clearly, Soft-Max\nsigni\ufb01cantly outperforms all other methods in most settings. When Kitchen is the target domain,\nC-DANN performs slightly better than Soft-Max, and all the methods perform close to each other.\nHard-Max is typically slightly worse than Soft-Max. This is mainly due to the low data-ef\ufb01ciency\nof the Hard-Max model (Section 4, Eq. 5, Eq. 6). We observe that with more training iterations,\nthe performance of Hard-Max can be further improved. These results verify the effectiveness of\nMDAN for multisource domain adaptation. To validate the statistical signi\ufb01cance of the results, we\nalso run a non-parametric Wilcoxon signed-ranked test for each task to compare Soft-Max with the\nother competitors (see more details in appendix). From the statistical test, we see that Soft-Max is\nconvincingly better than other methods.\n\n5.2 Digits Datasets\n\nFollowing the setting in [17], we combine four digits datasets (MNIST, MNIST-M, SVHN, SynthDig-\nits) to build the multisource domain dataset. We take each of MNIST-M, SVHN, and MNIST as\ntarget domain in turn, and the rest as sources. Each source domain has 20, 000 labeled images and\nthe target test set has 9, 000 examples.\nBaselines. We compare Hard-Max and Soft-Max of MDAN with 10 baselines: i). B-Source. A basic\nnetwork trained on each source domain (20, 000 images) without domain adaptation and tested on\nthe target domain. Among the three models, we report the one achieves the best performance on the\ntest set. ii). C-Source. A basic network trained on a combination of three source domains (20, 000\nimages for each) without domain adaptation and tested on the target domain. iii). B-DANN. We\n\n6\n\n\fTable 2: Accuracy on digit classi\ufb01cation. T: MNIST; M: MNIST-M, S: SVHN, D: SynthDigits.\n\nT+S+D/M M+T+D/S\n\nT+S+D/M M+T+D/S\n\nMethod\nB-Source\nB-DANN\nB-ADDA\nB-MTAE\nHard-Max\nMDAC\n\nS+M+D/T\n\n0.964\n0.967\n0.968\n0.862\n0.976\n0.755\n\n0.519\n0.591\n0.657\n0.534\n0.663\n0.563\n\n0.814\n0.818\n0.800\n0.703\n0.802\n0.604\n\nMethod\nC-Source\nC-DANN\nC-ADDA\nC-MTAE\nSoft-Max\nTarget\n\nS+M+D/T\n\n0.938\n0.925\n0.927\n0.821\n0.979\n0.987\n\n0.561\n0.651\n0.682\n0.596\n0.687\n0.901\n\n0.771\n0.776\n0.804\n0.701\n0.816\n0.898\n\ntrain DANNs [17] on each source-target domain pair (20, 000 images for each source) and test it on\ntarget. Again, we report the best score among the three. iv). C-DANN. We train a single DANN on a\ncombination of three source domains (20, 000 images for each). v). B-ADDA. We train ADDA [46]\non each source-target domain pair (20, 000 images for each source) and test it on the target domain.\nWe report the best accuracy among the three. vi).C-ADDA. We train ADDA on a combination of three\nsource domains (20, 000 images for each). vii). B-MTAE. We train MTAE [19] on each source-target\ndomain pair (20, 000 images for each source) and test it on the target domain. We report the best\naccuracy among the three. viii). C-MTAE. We train MTAE on a combination of three source domains\n(20, 000 images for each). ix). MDAC. MDAC [51] is a multiple source domain adaptation algorithm\nthat explores causal models to represent the relationship between the features X and class label Y .\nWe directly train MDAC on a combination of three source domains. x). Target. It is the basic network\ntrained and tested on the target data. It serves as an upper bound of DA algorithms. All the MDAN\nand baseline methods are built on the same basic network structure to put them on a equal footing.\nResults and Analysis. The classi\ufb01cation accuracy is shown in Table 2. The results show that MDAN\noutperforms all the baselines in the \ufb01rst two experiments and is comparable with Best-Single-DANN\nin the third experiment. For the combined sources, MDAN always perform better than the source-only\nbaseline (MDAN vs. Combine-Source). However, a naive combination of different training datasets\ncan sometimes even decrease the performance of the baseline methods. This conclusion comes from\nthree observations: First, directly training DANN on a combination of multiple sources leads to\nworse results than the source-only baseline (Combine-DANN vs. Combine-Source); Second, The\nperformance of Combine-DANN can be even worse than the Best-Single-DANN (the \ufb01rst and third\nexperiments); Third, directly training DANN on a combination of multiple sources always has lower\naccuracy compared with our approach (Combine-DANN vs. MDAN). We have similar observations\nfor ADDA and MTAE. Such observations verify that the domain adaptation methods designed for\nsingle source lead to suboptimal solutions when applied to multiple sources. It also veri\ufb01es the\nnecessity and superiority of MDAN for multiple source adaptation. Furthermore, we observe that\nadaptation to the SVHN dataset (the third experiment) is hard. In this case, increasing the number of\nsource domains does not help. We conjecture this is due to the large dissimilarity between the SVHN\ndata to the others. Surprisingly, using a single domain (best-Single DANN) in this case achieves\nthe best result. This indicates that in domain adaptation the quality of data (how close to the target\ndata) is much more important than the quantity (how many source domains). As a conclusion, this\nexperiment further demonstrates the effectiveness of MDAN when there are multiple source domains\navailable, where a naive combination of multiple sources using DANN may hurt generalization.\n\n5.3 WebCamT Vehicle Counting Dataset\n\nWebCamT is a public dataset for vehicle counting from large-scale city camera videos, which has low\nresolution (352 \u21e5 240), low frame rate (1 frame/second), and high occlusion. It has 60, 000 frames\nannotated with vehicle bounding box and count, divided into training and testing sets, with 42, 200\nand 17, 800 frames, respectively. Here we demonstrate the effectiveness of MDAN to count vehicles\nfrom an unlabeled target camera by adapting from multiple labeled source cameras: we select 8\ncameras located in different intersections of the city with different scenes, and each has more than\n2, 000 labeled images for our evaluations. Among these 8 cameras, we randomly pick two cameras\nand take each camera as the target camera, with the other 7 cameras as sources. We compute the\nproxy A-distance (PAD) [7] between each source camera and the target camera to approximate the\ndivergence between them. We then rank the source cameras by the PAD from low to high and choose\nthe \ufb01rst k cameras to form the k source domains. Thus the proposed methods and baselines can be\nevaluated on different numbers of sources (from 2 to 7). We implement the Hard-Max and Soft-Max\nMDAN, based on the basic vehicle counting network FCN [52]. We compare our method with two\nbaselines: FCN [52], a basic network without domain adaptation, and DANN [17], implemented\n\n7\n\n\fTable 3: Counting error statistics. S is the number of source cameras; T is the target camera id.\n\nS\nT\n2 A\n3 A\n4 A\n5 A\n6 A\n7 A\n\nMDAN\n\nHard-Max\n\nSoft-Max\n1.7140\n1.2363\n1.1965\n1.1942\n1.2877\n1.2984\n\nDANN\n1.9490\n1.3683\n1.5520\n1.4156\n2.0298\n1.5426\n\nFCN\n1.9094\n1.5545\n1.5499\n1.7925\n1.7505\n1.7646\n\nT\nB\nB\nB\nB\nB\nB\n\n1.8101\n1.3276\n1.3868\n1.4021\n1.4359\n1.4381\n\nMDAN\n\nHard-Max\n\nSoft-Max\n2.3438\n1.8680\n1.8487\n1.6016\n1.4644\n1.5126\n\nDANN\n2.5218\n2.0122\n2.1856\n1.7228\n1.5484\n1.5397\n\nFCN\n2.6528\n2.4319\n2.2351\n2.0504\n2.2832\n1.7324\n\n2.5059\n1.9092\n1.7375\n1.7758\n1.5912\n1.5989\n\non top of the same basic network. We record mean absolute error (MAE) between true count and\nestimated count.\nResults and Analysis. The counting error of different methods is compared in Table 3. The Hard-\nMax version achieves lower error than DANN and FCN in most settings for both target cameras.\nThe Soft-Max approximation outperforms all the baselines and the Hard-Max in most settings,\ndemonstrating the effectiveness of the smooth and adaptative approximation. The lowest MAE\nachieved by Soft-Max is 1.1942. Such MAE means that there is only around one vehicle miscount\nfor each frame (the average number of vehicles in one frame is around 20). Fig. 2 shows the counting\nresults of Soft-Max for the two target cameras under the 5 source cameras setting. We can see that\nthe proposed method accurately counts the vehicles of each target camera for long time sequences.\nDoes adding more source cameras always help improve the performance on the target camera? To\nanswer this question, we analyze the counting error when we vary the number of source cameras\nas shown in Fig. 3a, where the x-axis refers to number of source cameras and the y-axis includes\nboth the MAE curve on the target camera as well as the PAD distance (bar chart) between the pair of\nsource and target cameras. From the curves, we see the counting error goes down with more source\ncameras at the beginning, while it goes up when more sources are added at the end. This phenomenon\nshows that the performance on the target domain also depends on the its distance to the added source\ndomain, i.e., it is not always bene\ufb01cial to naively incorporate more source domains into training. To\nillustrate this better, we also show the PAD of the newly added camera in the bar chart of Fig. 3a. By\nobserving the PAD and the counting error, we see the performance on the target can degrade when\nthe newly added source camera has large divergence from the target camera. To show that MDAN\ncan indeed decrease the divergences between target domain and multiple source domains, in Fig. 3b\nwe plot the PAD distances between the target domains and the corresponding source domains. We\ncan see that MDAN consistently decrease the PAD distances between all pairs of target and source\ndomains, for both camera A and camera B. From this experiment we conclude that our proposed\nMDAN models are effective in multiple source domain adaptation.\n\nFigure 2: Counting results for target camera A (\ufb01rst row) and B (second row). X-frames; Y-Counts.\n\n6 Related Work\n\nA number of adaptation approaches have been studied in recent years. From the theoretical aspect,\nseveral theoretical results have been derived in the form of upper bounds on the generalization target\nerror by learning from the source data. A keypoint of the theoretical frameworks is estimating\nthe distribution shift between source and target. Kifer et al. [27] proposed the H-divergence to\n\n8\n\n\f(b) PAD distance before and after training MDAN.\n\n(a) Counting error and PAD over different source num-\nbers.\nFigure 3: PAD distance over different source domains along with their changes before and after\ntraining MDAN.\nmeasure the similarity between two domains and derived a generalization bound on the target domain\nusing empirical error on the source domain and the H-divergence between the source and the target.\nThis idea has later been extended to multisource domain adaptation [9] and the corresponding\ngeneralization bound has been developed as well. Ben-David et al. [8] provide a generalization bound\nfor domain adaptation on the target risk which generalizes the standard bound on the source risk.\nThis work formalizes a natural intuition of DA: reducing the two distributions while ensuring a low\nerror on the source domain and justi\ufb01es many DA algorithms. Based on this work, Mansour et al.\n[33] introduce a new divergence measure: discrepancy distance, whose empirical estimate is based\non the Rademacher complexity [28]. See [13, 35] for more details.\nFollowing the theoretical developments, many DA algorithms have been proposed, such as instance-\nbased methods [44]; feature-based methods [6]; and parameter-based methods [15]. Recent studies\nhave shown that deep neural networks can learn more transferable features for DA [14, 20, 50].\nBousmalis et al. [10] develop domain separation networks to extract image representations that are\npartitioned into two subspaces: domain private component and cross-domain shared component.\nThe partitioned representation is utilized to reconstruct the images from both domains, improving\nthe DA performance. Ganin et al. [17] propose a domain-adversarial neural network to learn the\ndomain indiscriminate but main-task discriminative features. Adversarial training techniques that\naim to build feature representations that are indistinguishable between source and target domains\nhave been proposed in the last few years [2, 17]. Speci\ufb01cally, one of the central ideas is to use neural\nnetworks, which are powerful function approximators, to approximate a distance measure known\nas the H-divergence between two domains [7, 8, 27]. The overall algorithm can be viewed as a\nzero-sum two-player game: one network tries to learn feature representations that can fool the other\nnetwork, whose goal is to distinguish representations generated from the source domain between\nthose generated from the target domain. The goal of the algorithm is to \ufb01nd a Nash-equilibrium of the\ngame. Ideally, at such equilibrium state, feature representations from the source domain will share\nthe same distributions as those from the target domain. Although these works generally outperform\nnon-deep learning based methods, they only focus on the single-source-single-target DA problem,\nand much work is rather empirical design without statistical guarantees. Hoffman et al. [23] present a\ndomain transform mixture model for multisource DA, which is based on non-deep architectures and\nis dif\ufb01cult to scale up.\n\n7 Conclusion\n\nWe theoretically analyze generalization bounds for DA under the setting of multiple source domains\nwith labeled instances and one target domain with unlabeled instances. Speci\ufb01cally, we propose\naverage case generalization bounds for both classi\ufb01cation and regression problems. The new bounds\nhave interesting interpretations and the one for classi\ufb01cation reduces to an existing bound when there\nis only one source domain. Following our theoretical results, we propose two MDAN to learn feature\nrepresentations that are invariant under multiple domain shifts while at the same time being discrimi-\nnative for the learning task. Both hard and soft versions of MDAN are generalizations of the popular\nDANN to the case when multiple source domains are available. Empirically, MDAN outperforms the\nstate-of-the-art DA methods on three real-world datasets, including a sentiment analysis task, a digit\nclassi\ufb01cation task, and a visual vehicle counting task, demonstrating its effectiveness in multisource\ndomain adaptation for both classi\ufb01cation and regression problems.\n\n9\n\n\fAcknowledgement\nHZ and GG gratefully acknowledge support from ONR, award number N000141512365. This\nresearch was supported in part by Fundac\u00b8\u02dcao para a Ci\u02c6encia e a Tecnologia (project FCT\n[SFRH/BD/113729/2015] and a grant from the Carnegie Mellon-Portugal program). HZ would\nalso like to thank Remi Tachet for his valuable comments and suggestions.\n\nReferences\n[1] Tameem Adel, Han Zhao, and Alexander Wong. Unsupervised domain adaptation with a relaxed\n\ncovariate shift assumption. In AAAI, pages 1691\u20131697, 2017.\n\n[2] Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc\u00b8ois Laviolette, and Mario Marchand.\n\nDomain-adversarial neural networks. arXiv preprint arXiv:1412.4446, 2014.\n\n[3] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations.\n\ncambridge university press, 2009.\n\n[4] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection\nand hierarchical image segmentation. IEEE transactions on pattern analysis and machine\nintelligence, 33(5):898\u2013916, 2011.\n\n[5] Mahsa Baktashmotlagh, Mehrtash T Harandi, Brian C Lovell, and Mathieu Salzmann. Un-\nsupervised domain adaptation by domain invariant projection. In Proceedings of the IEEE\nInternational Conference on Computer Vision, pages 769\u2013776, 2013.\n\n[6] Carlos J Becker, Christos M Christoudias, and Pascal Fua. Non-linear domain adaptation with\n\nboosting. In Advances in Neural Information Processing Systems, pages 485\u2013493, 2013.\n\n[7] Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira, et al. Analysis of represen-\ntations for domain adaptation. Advances in neural information processing systems, 19:137,\n2007.\n\n[8] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jen-\nnifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79\n(1-2):151\u2013175, 2010.\n\n[9] John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning\nbounds for domain adaptation. In Advances in neural information processing systems, pages\n129\u2013136, 2008.\n\n[10] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru\nErhan. Domain separation networks. In Advances in Neural Information Processing Systems,\npages 343\u2013351, 2016.\n\n[11] Minmin Chen, Zhixiang Xu, Kilian Weinberger, and Fei Sha. Marginalized denoising autoen-\n\ncoders for domain adaptation. arXiv preprint arXiv:1206.4683, 2012.\n\n[12] Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and\n\nalgorithm for regression. Theoretical Computer Science, 519:103\u2013126, 2014.\n\n[13] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection\nbias correction theory. In International Conference on Algorithmic Learning Theory, pages\n38\u201353. Springer, 2008.\n\n[14] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor\nDarrell. Decaf: A deep convolutional activation feature for generic visual recognition. In Icml,\nvolume 32, pages 647\u2013655, 2014.\n\n[15] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi\u2013task learning. In Proceedings\nof the tenth ACM SIGKDD international conference on Knowledge discovery and data mining,\npages 109\u2013117. ACM, 2004.\n\n10\n\n\f[16] Chuang Gan, Tianbao Yang, and Boqing Gong. Learning attributes equals multi-source domain\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern\n\ngeneralization.\nRecognition, pages 87\u201397, 2016.\n\n[17] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Franc\u00b8ois\nLaviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural\nnetworks. Journal of Machine Learning Research, 17(59):1\u201335, 2016.\n\n[18] Pascal Germain, Amaury Habrard, Franc\u00b8ois Laviolette, and Emilie Morvant. A pac-bayesian\napproach for domain adaptation with specialization to linear classi\ufb01ers. In ICML (3), pages\n738\u2013746, 2013.\n\n[19] Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang, and David Balduzzi. Domain gen-\neralization for object recognition with multi-task autoencoders. In Proceedings of the IEEE\ninternational conference on computer vision, pages 2551\u20132559, 2015.\n\n[20] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale sentiment\nclassi\ufb01cation: A deep learning approach. In Proceedings of the 28th international conference\non machine learning (ICML-11), pages 513\u2013520, 2011.\n\n[21] Boqing Gong, Kristen Grauman, and Fei Sha. Connecting the dots with landmarks: Discrimi-\nnatively learning domain-invariant features for unsupervised domain adaptation. In ICML (1),\npages 222\u2013230, 2013.\n\n[22] Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Unsupervised adaptation across\ndomain shifts by generating intermediate data representations. IEEE transactions on pattern\nanalysis and machine intelligence, 36(11):2288\u20132302, 2014.\n\n[23] Judy Hoffman, Brian Kulis, Trevor Darrell, and Kate Saenko. Discovering latent domains for\nmultisource domain adaptation. In Computer Vision\u2013ECCV 2012, pages 702\u2013715. Springer,\n2012.\n\n[24] Judy Hoffman, Mehryar Mohri, and Ningshan Zhang. Multiple-source adaptation for regression\n\nproblems. arXiv preprint arXiv:1711.05037, 2017.\n\n[25] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A\nEfros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. arXiv\npreprint arXiv:1711.03213, 2017.\n\n[26] I-Hong Jhuo, Dong Liu, DT Lee, and Shih-Fu Chang. Robust visual domain adaptation with\nlow-rank reconstruction. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE\nConference on, pages 2168\u20132175. IEEE, 2012.\n\n[27] Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In\nProceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages\n180\u2013191. VLDB Endowment, 2004.\n\n[28] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transac-\n\ntions on Information Theory, 47(5):1902\u20131914, 2001.\n\n[29] Yann LeCun, L\u00b4eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[30] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features\nIn International Conference on Machine Learning, pages\n\nwith deep adaptation networks.\n97\u2013105, 2015.\n\n[31] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational\n\nfair autoencoder. arXiv preprint arXiv:1511.00830, 2015.\n\n[32] Yishay Mansour and Mariano Schain. Robust domain adaptation. In ISAIM, 2012.\n[33] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning\n\nbounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.\n\n11\n\n\f[34] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation with multiple\n\nsources. In Advances in neural information processing systems, pages 1041\u20131048, 2009.\n\n[35] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation and\nthe r\u00b4enyi divergence. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial\nIntelligence, pages 367\u2013374. AUAI Press, 2009.\n\n[36] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning.\n\nMIT press, 2012.\n\n[37] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. In NIPS workshop on deep\nlearning and unsupervised feature learning, volume 2011, page 5, 2011.\n\n[38] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning.\n\nknowledge and data engineering, 22(10):1345\u20131359, 2010.\n\nIEEE Transactions on\n\n[39] Zhongyi Pei, Zhangjie Cao, Mingsheng Long, and Jianmin Wang. Multi-adversarial domain\n\nadaptation. 2018.\n\n[40] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual\nrecognition challenge. International Journal of Computer Vision, 115(3):211\u2013252, 2015.\n\n[41] Ashish Shrivastava, Tomas P\ufb01ster, Oncel Tuzel, Josh Susskind, Wenda Wang, and Russ Webb.\nLearning from simulated and unsupervised images through adversarial training. arXiv preprint\narXiv:1612.07828, 2016.\n\n[42] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A dirt-t approach to unsupervised\n\ndomain adaptation. arXiv preprint arXiv:1802.08735, 2018.\n\n[43] Qian Sun, Rita Chattopadhyay, Sethuraman Panchanathan, and Jieping Ye. A two-stage\nweighting framework for multi-source domain adaptation. In Advances in neural information\nprocessing systems, pages 505\u2013513, 2011.\n\n[44] Yuta Tsuboi, Hisashi Kashima, Shohei Hido, Steffen Bickel, and Masashi Sugiyama. Direct\ndensity ratio estimation for large-scale covariate shift adaptation. Journal of Information\nProcessing, 17:138\u2013155, 2009.\n\n[45] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across\ndomains and tasks. In Proceedings of the IEEE International Conference on Computer Vision,\npages 4068\u20134076, 2015.\n\n[46] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain\n\nadaptation. arXiv preprint arXiv:1702.05464, 2017.\n\n[47] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and\ncomposing robust features with denoising autoencoders. In Proceedings of the 25th international\nconference on Machine learning, pages 1096\u20131103. ACM, 2008.\n\n[48] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol.\nStacked denoising autoencoders: Learning useful representations in a deep network with a local\ndenoising criterion. Journal of Machine Learning Research, 11(Dec):3371\u20133408, 2010.\n\n[49] Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391\u2013423,\n\n2012.\n\n[50] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in\ndeep neural networks? In Advances in neural information processing systems, pages 3320\u20133328,\n2014.\n\n[51] Kun Zhang, Mingming Gong, and Bernhard Sch\u00a8olkopf. Multi-source domain adaptation: A\n\ncausal view. In AAAI, pages 3150\u20133157, 2015.\n\n[52] Shanghang Zhang, Guanhang Wu, Joao P Costeira, and Jose MF Moura. Understanding traf\ufb01c\n\ndensity from large-scale web camera data. arXiv preprint arXiv:1703.05868, 2017.\n\n12\n\n\f", "award": [], "sourceid": 5151, "authors": [{"given_name": "Han", "family_name": "Zhao", "institution": "Carnegie Mellon University"}, {"given_name": "Shanghang", "family_name": "Zhang", "institution": "Carnegie Mellon University"}, {"given_name": "Guanhang", "family_name": "Wu", "institution": "Carnegie Mellon University"}, {"given_name": "Jos\u00e9 M. F.", "family_name": "Moura", "institution": "Carnegie Mellon University"}, {"given_name": "Joao", "family_name": "Costeira", "institution": "Instituto Superior Tecnico VAT- 501507930"}, {"given_name": "Geoffrey", "family_name": "Gordon", "institution": "MSR Montr\u00e9al & CMU"}]}