{"title": "Extracting Relationships by Multi-Domain Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 6798, "page_last": 6809, "abstract": "In many biological and medical contexts, we construct a large labeled corpus by aggregating many sources to use in target prediction tasks.  Unfortunately, many of the sources may be irrelevant to our target task, so ignoring the structure of the dataset is detrimental.  This work proposes a novel approach, the Multiple Domain Matching Network (MDMN), to exploit this structure. MDMN embeds all data into a shared feature space while learning which domains share strong statistical relationships. These relationships are often insightful in their own right, and they allow domains to share strength without interference from irrelevant data. This methodology builds on existing distribution-matching approaches by assuming that source domains are varied and outcomes multi-factorial. Therefore, each domain should only match a relevant subset. Theoretical analysis shows that the proposed approach can have a tighter generalization bound than existing multiple-domain adaptation approaches.  Empirically, we show that the proposed methodology handles higher numbers of source domains (up to 21 empirically), and provides state-of-the-art performance on image, text, and multi-channel time series classification, including clinically relevant data of a novel treatment of Autism Spectrum Disorder.", "full_text": "Extracting Relationships by Multi-Domain Matching\n\nYitong Li1, Michael Murias2, Samantha Major3, Geraldine Dawson3 and David E. Carlson1,4,5\n\n1Department of Electrical and Computer Engineering, Duke University\n\n2Duke Institute for Brain Sciences, Duke University\n\n3Departments of Psychiatry and Behavioral Sciences, Duke University\n4Department of Civil and Environmental Engineering, Duke University\n\n5Department of Biostatistics and Bioinformatics, Duke University\n\n{yitong.li,michael.murias,samantha.major,\ngeraldine.dawson,david.carlson}@duke.edu\n\nAbstract\n\nIn many biological and medical contexts, we construct a large labeled corpus by\naggregating many sources to use in target prediction tasks. Unfortunately, many\nof the sources may be irrelevant to our target task, so ignoring the structure of\nthe dataset is detrimental. This work proposes a novel approach, the Multiple\nDomain Matching Network (MDMN), to exploit this structure. MDMN embeds\nall data into a shared feature space while learning which domains share strong\nstatistical relationships. These relationships are often insightful in their own right,\nand they allow domains to share strength without interference from irrelevant\ndata. This methodology builds on existing distribution-matching approaches by\nassuming that source domains are varied and outcomes multi-factorial. Therefore,\neach domain should only match a relevant subset. Theoretical analysis shows\nthat the proposed approach can have a tighter generalization bound than existing\nmultiple-domain adaptation approaches. Empirically, we show that the proposed\nmethodology handles higher numbers of source domains (up to 21 empirically), and\nprovides state-of-the-art performance on image, text, and multi-channel time series\nclassi\ufb01cation, including clinical outcome data in an open label trial evaluating a\nnovel treatment for Autism Spectrum Disorder.\n\n1\n\nIntroduction\n\nDeep learning methods have shown unparalleled performance when trained on vast amounts of\ndiverse labeled training data [21], often collected at great cost. In many contexts, especially medical\nand biological, it is prohibitively expensive to collect or label the number of observations necessary\nto train an accurate deep neural network classi\ufb01er. However, a number of related sources, each\nwith \u201cmoderate\u201d data, may already be available, which can be combined to construct a large corpus.\nNaively using the combined source data is often an ineffective strategy; instead, what is needed is\nunsupervised multiple-domain adaptation. Given labeled data from several source domains (each\nrepresenting, e.g., one patient in a medical trial, or reviews of one type of product), and unlabeled\ndata from target domains (new patients, or new product categories), we wish to train a classi\ufb01er that\nmakes accurate predictions about the target domain data at test time.\nRecent approaches to multiple-domain adaptation involve learning a mapping from each domain into\na common feature space, in which observations from the target and source domains have similar\ndistributions [14, 45, 39, 30]. At test time, a target-domain observation is \ufb01rst mapped into this\nshared feature space, then classi\ufb01ed. However, few of the existing works can model the relationship\namong different domains, which we note is important for several reasons. First, even though data in\ndifferent domains share labels, their cause and symptoms may be different. Patients with the same\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Baseline model\n\n(b) Proposed model MDMN\n\nFigure 1: Figure 1(a) visualizes previous multiple-domain adaptation methods. Figure 1(b) visualizes\nthe proposed method, with domain adaptation between all domains.\n\ncondition can be caused by various reasons and diagnosed while sharing only a subset of symptoms.\nExtracting these relationships between patients is helpful in practice because it limits the model to\nonly relevant information. Second, as mentioned above, a training corpus may be constructed with\nonly a small number of sources within a larger population. For example, we might collect data from\nmany patients with \u201csmall\u201d data and domain adaptation is used to generalize to new patients [3].\nTherefore, extracting these relationships is of practical importance.\nIn addition to the practical argument, [32] gives a theoretical proof that adding irrelevant source\ndomains harms performance bounds on multiple-domain adaptation. Therefore, it is necessary to\nautomatically choose a weighting over source domains to utilize only relevant domains. There are\nonly a few works that address such a domain weighting strategy [45]. In this manuscript, we extend\nthe proof techniques of [4, 32] to show that a multiple-domain weighting strategy can have a tighter\ngeneralization bound than traditional multiple domain approaches.\nNotably, many recently proposed transfer learn-\ning strategies are based on minimizing the H-\ndivergence between domains in feature space,\nwhich was shown to bound generalization error\nin domain adaptation [4]. Compared to standard\nL1-divergence, H-divergence limits the hypoth-\nesis to a given class, which can be better esti-\nmated using \ufb01nite samples theoretically. The\ntarget error bound using H-divergence has the\ndesirable property that it can be estimated by\nlearning a classi\ufb01er between the source and tar-\nget domains with \ufb01nite VC dimension, moti-\nvating the Domain Adversarial Neural Network\n(DANN) [14]. However, neural network usually\nhas large VC dimensions, making the bound\nusing H-divergence loose in practice. In this\nwork, we propose to use a \u2018Wasserstein-like\u2019\nmetric to de\ufb01ne domain similarity in the proofs.\n\u2018Wasserstein-like\u2019 distance in our work extends\nthe binary output in H-divergence to real prob-\nability output.\nOur main contribution is our novel approach to\nmultiple-domain adaptation. A key idea from\nprior work is to match every source domain\u2019s\nfeature-space distribution to that of the target domain [37, 29]. In contrast, we map the distribution (i)\namong sources and target and (ii) within source domains. It is only necessary and prudent to match\none domain to a relevant subset of the others. This makes sense particularly in medical contexts,\nas nearly all diagnoses address a multi-factorial disease. The Wasserstein distance is chosen to\n\nFigure 2: Figure 2 is a visualization of the graph\ninduced on 22 patients by the proposed model,\nMDMN. Each node represents one subject and the\ntarget domain is shown in blue. Note that although\nthe target is only strongly connected to one source\ndomain, the links between source domains allow\nthem to share strength and make more robust pre-\ndictions. The lines are labeled by the mean of the\ndirectional weights learned in MDMN.\n\n2\n\nMNISTMNISTMSVHNUSPSSource DomainsMNISTMNISTMSVHNUSPSSource Domains\ffacilitate the mathematical and theoretical operations of pairwise matching in multiple domains. The\nunderlying idea is also closely related to optimal transport for domain adaptation [7, 8], but address\nmultiple domain matching.\nThe proposed method, MDMN, is visualized in Figure 1(b), compared with standard source to target\nmatching scheme (Figure 1(a)), showing the matching of source domains. This tweak allows for\nalready-similar domains to merge and share statistical strength, while keeping distant clusters of\ndomains separate from one another. At test time, only the domains most relevant to the target are\nused [5, 32]. In essence, this induces a potentially sparse graph on all domains, which is visualized\nfor 22 patients from one of our experiments in Figure 2. Any neural network architecture can be\nmodi\ufb01ed to use MDMN, which can be considered a stand-alone domain-matching module.\n\n2 Method\n\nMultiple Domain Matching Network (MDMN) is based upon the intuition that in the extracted feature\nspace, inherently similar domains should have similar or identical distributions. By sharing strength\nwithin source domains, MDMN can better deal with the over\ufb01tting problem within each domain, a\ncommon problem in scienti\ufb01c domains. Meanwhile, the relationships between domains can also be\nlearned, which is of interest in addition to classi\ufb01cation performance.\nIn the following, suppose we are given N ob-\nservations {(xi, yi, si)}N\ni=1 from S domains,\nwhere yi is the desired label for xi and si is\nthe domain. (In the target domain, the label y\nis not provided and will instead be predicted.)\nFor brevity, we assume source domains are\n1, 2,\u00b7\u00b7\u00b7 , S 1, and the Sth domain is the sin-\ngle target domain. In fact, our approach works\nanalogously for any number of unlabeled target\ndomains.\nThe whole framework, shown in Figure 3, is\ncomposed of an feature extractor (or encoder), a domain adapter (Sec. 2.1) and a label classi\ufb01er\n(Sec. 2.2). In this work, we instantiate all three as neural networks. The encoder E maps data points\nx to feature vectors E(x). These features are then used by the label classi\ufb01er to make predictions for\nthe supervised task. They are also used by the domain adapter, encouraging extracted features E(x)\nto be similar across nearby domains.\n\nFigure 3: The framework of MDMN.\n\n2.1 Domain Adaptation with Relationship Extraction\n\nThis section details the structure of the domain adapter. In order to adapt one domain to the others,\none approach is to consider a penalty proportional to the distance between each distribution and the\nweighted mean of the rest. Speci\ufb01cally, let Ps be the distribution over data points x in domain Ds,\nand P/s = 1\n/s . Note that the\nweight ws = [ws1,\u00b7\u00b7\u00b7 , wsS] is domain speci\ufb01c and ws 2 RS, where ws lies on the simplex with\n||ws||1 = 1, wss0  0 for s0 = 1, . . . , S and wss = 0, which will be learned in the framework. In the\nfollowing, we will use Ds to represent for its distribution Ps in order to simplify the notation. Then\nwe can encourage all domains to be close together in the feature space by adding the following term\nto the loss:\n\ns0=1 wss0Ps0 the distribution of data from all other domains Dws\n\nS1PS\n\ns=1 sd(Ds,Dws\n/s ),\n\nLD(E(x; \u2713E); \u2713D) =PS\n\n(1)\nwhere d(\u00b7,\u00b7) is a distance between distributions (domains). Here it is used to measure the discrepancy\nbetween one domain and a weighted average of the rest. We assume the weight s equals\nS1 for\ns = 1,\u00b7\u00b7\u00b7 , S  1 and S = 1 to balance the penalty for the source and target domains, although this\nmay be chosen as a tuning parameter. LD is the total domain adapter loss function.\nFor the rest of this manuscript, we have chosen to use the Wasserstein distance as d(\u00b7,\u00b7). This approach\nis facilitated by the use of Kantorovich-Rubenstein dual formulation of the Wasserstein-1 distance\n[2], which is given for distributions D1 and D2 as d(D1,D2) = sup||f||L\uf8ff1 Ex\u21e0P1[f (E(x))] \nEx\u21e0P2[f (E(x))], where ||f||L \uf8ff 1 denotes that the Lipschitz constant of the function f (\u00b7) is at\n\n1\n\n3\n\n\fmost 1, i.e. |f (x0)  f (x)|\uf8ff|| x0  x||2. f () is any Lipschitz-smooth nonlinear function, which\ncan be approximated by a neural network [2]. When S is reasonably small (< 100), it is feasible to\ninclude S small neural networks fs(\u00b7; \u2713D) to approximate these distances for each domain. In our\nimplementation, we use shared layers in the domain adapter to enhance computational ef\ufb01ciency and\nthe output of the domain adapter is f (\u00b7; \u2713D) = [f1,\u00b7\u00b7\u00b7 , fs,\u00b7\u00b7\u00b7 , fS]. The domain loss term is then\ngiven as\n(2)\nTo make the domain penalty in (2) feasible, it is necessary to discuss how the penalty can be included\nin the optimization \ufb02ow of neural network training. To develop this mathematical approach, let \u21e1s be\nthe proportion of the data that comes from the sth domain, then the penalty can be rewritten as\n\ns=1 sup||fs||L\uf8ff1 s\u21e3Ex\u21e0Ds[fs(E(x))]  Ex\u21e0Dws\nPS\n\n[fs(E(x))]\u2318 .\n\n/s\n\n(3)\nwhere f (E(x)) is the concatenation of fs(E(x)), i.e. f (E(x)) = [f1(E(x)),\u00b7\u00b7\u00b7 , fS(E(x))]T .\nr 2 RS is de\ufb01ned as\n\nS\u21e1s \u21e5 rT\n\n[fs(E(x))]\u2318\n\n1\n\n/s\n\ns=1 s\u21e3Ex\u21e0Ds[fs(E(x))]  Ex\u21e0Dws\nSPS\ns f (E(x))]\u21e4\ns f (E(x))]i ,\n\n= Es\u21e0Uniform(1,...,S)\u21e5Ex\u21e0Ds[rT\n= Es\u21e0\u21e1hEx\u21e0Ds[ 1\nrs =\u21e2 swss0,\n\ns0 6= s\ns0 = s , s0 = 1,\u00b7\u00b7\u00b7 , S.\n\ns,\n\nThe form in (3) is natural to include in an optimization loop because the expectation is empirically\napproximated by a mini-batch of data. Let {(xi, si)}, i = 1, . . . , N denote observations and their\nassociated domain si, and then\n\n(4)\n\nEs\u21e0\u21e1hEx\u21e0Ds[ 1\n\nS\u21e1s \u21e5 rT\n\ns f (E(x))]i ' 1\n\nSNPN\n\ni=1 \u21e11\n\nsi rT\n\nsif (E(xi; \u2713E); \u2713D).\n\n(5)\n\nThe weight vector ws for domain Ds should choose to focus only on relevant domains, and the\nweights on mismatched domains should be very small. As noted previously, adding uncorrelated\ndomains hurts generalization performance [32]. In our Theorem 3.3, we shows that a weighting\nscheme with these properties decreases the target error bound. Once the function fs(\u00b7; \u2713D) is known,\nwe can estimate ws by using a softmax transformation on the function expectations from fs between\nany two domains. Speci\ufb01cally, the weight ws to match Ds to other domains is calculated as\nws = softmax/s(\uf8ffls), with lss0 =Ex\u21e0Ds[fs(E(x))]  Ex\u21e0Ds0 [fs(E(x))] ,\n\n(6)\nwhere ls = [ls1,\u00b7\u00b7\u00b7 , lss0,\u00b7\u00b7\u00b7 , lsS]. The subscript /s means that the value wss is restricted to 0\nand lss is excluded from the softmax. The scalar quantity \uf8ff controls how peaked ws is. Note that\nsetting ws in (2) to the closest domain and 0 otherwise would correspond to the \uf8ff ! 1 case, and\n\uf8ff ! 0 corresponds to an unweighted (e.g. conventional) case. It is bene\ufb01cial to force the domain\nregularizer to match to multiple, but not necessarily all, available domains. Practically, we can either\nmodify \uf8ff in the softmax or change the Lipschitz constant used to calculate the distance (as was done).\nAs an example, the learned graph connectivity is shown in Figure 2 is constructed by thresholding\n2 (wss0 + ws0s) to determine connectivity between nodes.\n1\n\n2.2 Combining the Loss Terms\nThe proposed method uses the loss in (5) to perform the domain matching. A label classi\ufb01er is also\nnecessary, which is de\ufb01ned as a neural network parameterized by \u2713Y . The label classi\ufb01er in Figure 3\nis represented as Y [E(x)], where the classi\ufb01er Y is applied on the extracted feature vector E(x). The\nlabel predictor usually contains several fully connected layers with nonlinear activation functions. The\nc=1 yic log Yc[E(xi)],\n\ncross entropy loss is used for classi\ufb01cation, i.e. LY (x, y; \u2713Y , \u2713E) =PN\nwhere Yc means the cth entry of the output. The MSE loss is used for regression.\nWith the label prediction loss LY , the complete network loss is given by\nmin\u2713E ,\u2713Y max\u2713D LY (\u2713Y , \u2713E) + \u21e2LD(\u2713D, \u2713E),\n\n(7)\nwhere \u2713E denotes the parameters in the feature extractor/encoder, \u2713D denotes the parameters in the\ndomain adapter, and \u2713Y in the label classi\ufb01er. The pseudo code for training is given in Algorithm 1.\n\ni=1PC\n\n4\n\n\fAlgorithm 1 Multiple Source Domain Adaptation via WDA\n\nInput: Source samples from Ds, s = 1,\u00b7\u00b7\u00b7 , S  1 and target samples from DS. Note that we\nassume index 1,\u00b7\u00b7\u00b7 , S  1 are for source domains and S is for the target domain. Iteration kY\nand kD for training label classi\ufb01er and domain discriminator.\nOutput: Classi\ufb01er parameters \u2713E, \u2713Y , \u2713D.\n\nfor iter = 1 to max_iter do\n\nSample a mini-batch of {xs} from {Ds}S1\nfor iterY = 1 to kY do\n\ns=1 and {xt} from DS.\n\nCompute lss0 = Ex2Ds [fs(E(x))]  Ex2Ds0 [fs(E(x))] for 8s, s0 2 [1, S].\nCompute the weight vectors ws = softmax/s(ls) and wss = 0 for 8s 2 [1, S], where\nls = (ls1,\u00b7\u00b7\u00b7 , lsS).\nCompute domain loss LW\nCompute r\u2713Y = @LY\nUpdate \u2713Y = \u2713Y  r\u2713Y , \u2713E = \u2713E  r\u2713E.\n\nD (xs, xt) and classi\ufb01er loss LY (xs).\nand r\u2713E = @LY\n\n+ \u21e2 @LD\n@\u2713E\n\n@\u2713E\n\n@\u2713Y\n\nend for\nfor iterD = 1 to kD do\n\nUpdate the weight vectors ws, 8s 2 [1, S].\nCompute LD(xs, xt) and r\u2713D = @LD\n.\nUpdate \u2713D = \u2713D + r\u2713D.\n\n@\u2713D\n\nend for\n\nend for\n\nDuring training, the target domain weight S in eq. (1) is always set to one, while sources domain\nweights are normalized to have sum one. This is because the ultimate goal is to work well on target\ndomain. We use the gradient penalty introduced in [18] to implement the Lipschitz constraint. A\nconcern is that the feature scale may change and impact the Wasserstein distance. One potential\nsolution to this is to include batch normalization to keep the summary statistics of the extracted\nfeatures constant. In practice, this is not necessary. Adam [20] is used as the optimization method\nwhile the gradient descent step in Algorithm 1 re\ufb02ects the basic strategy.\n\n2.3 Complexity Analysis\n\nAlthough the proposed algorithm computes pairwise domain distance, the computational cost in\npractice is similar compared to standard DANN model.\nFor the domain loss functions, we share all the bottom layers for all domains. This is similar to the\nsetup of a multi-class domain classi\ufb01er with softmax output while in our model, the output is a real\nnumber. Speci\ufb01cally, the pairwise distance (6) is updated in each mini-batch by averaging samples in\nthe same domain.\n\n\u02c6lss0 \u21e1\n\n1\n\nns X8xi2Ds\n\nfs(E(xi)) \n\n1\n\nns0 X8xi2Ds0\n\nfs(E(xi))\n\n(8)\n\nBecause these pairwise calculations happen late in the network, their computational cost is dwarfed\nby feature generation. We believe that the methods will easily scale to hundreds of domains based on\ncomputational and memory scaling. We use exponential smoothing during the updates to improve\nthe quality of the estimates, with lt+1\nss0 is the value from current iteration\u2019s\nmini-batch. Then the softmax is applied on the calculated values to get the weight wss0. This\nprocedure is used to update ws, so those parameters are not included in the backpropagation. The\ndomain weights and network parameters are updated iteratively, as shown in Algorithm 1.\n\nss0 = 0.9lt\n\nss0 + 0.1\u02c6lc\n\nss0. \u02c6lc\n\n3 Theoretical Results\n\nIn this section, we investigate the theorems and derivations used to bound the target error with\nthe given method in Section 2. Speci\ufb01cally, the target error is bounded by the source error, the\nsource-target distance plus additional terms which is constant under certain data and hypothesis\n\n5\n\n\fclasses. The theory is developed based on prior theories of source to target adaptation. The adaptation\nwithin source domains can be developed in the same way. Additional details and derivations are\navailable in the Supplemental Section A.\nLet Ds for s = 1,\u00b7\u00b7\u00b7 , S and DT represent the source and the target domain, respectively. Note that\nthere is a notation change in the target domain, where the Sth domain was denoted as the target in\nprevious section. Here, it is easier to separate the target domain out. Suppose that there is probabilistic\ntrue labeling functions gs, gT : X! [0, 1] and a probabilistic hypothesis f : X! [0, 1], which in\nour case is a neural network. The output value of the labeling function determines the probability that\nthe sample is 0 or 1. gs, gT are assumed Lipschitz smooth with parameters s and T , respectively.\nThis differs from the previous derivation [14] that assumes that the hypothesis and labeling function\nwere deterministic ({0, 1}). In the following, the notation of encoder E() is removed for simplicity.\nThus f (x) is actually f (E(x; \u2713E); \u2713D). Since we \ufb01rst only focus on the adaptation from source to\ntarget, the output of f (\u00b7) in this section is a scalar (The last element of f (\u00b7)). Same for notation ws,\nwhich is the domain similarity of Ds and target.\nDe\ufb01nition 3.1 (Probabilistic Classi\ufb01er Discrepancy). The probabilistic classi\ufb01er discrepancy for\ndomain Ds is de\ufb01ned as\n(9)\nNote that if the label hypothesis is limited to {0, 1}, this is classi\ufb01cation accuracy. In order to\nconstruct our main theorem, we use notation ||f||L 6  to denote -smooth function. Mathematical\ndetails are given in De\ufb01nition A.6 in the appendix. Next we de\ufb01ne the weighted Wasserstein-like\nquantity between sources and the target.\nDe\ufb01nition 3.2 (Weighted Wasserstein-like quantity). Given S multi-source probability distributions\nPs, s = 1,\u00b7\u00b7\u00b7 , S and PT for the target domain, the difference between the weighted source domains\n{Ds}S\n(10)\n\ns=1 and target domain DT is described as,\n\u21b5(DT ,Ps wsDs) = maxf :X! [0,1],||f||L\uf8ff Ex\u21e0DT [f (x)]  Ex\u21e0Ps wsDs[f (x)].\n\nNote that if the bound on the function from 0 to 1 is removed, then this quantity is the Kantorovich-\nRubinstein dual form of the Wasserstein-1 distance. As  ! 1, this is the same as the commonly\nused L1-divergence or variation divergence [4]. Thus, we can derive this theorem with H-divergence\nexactly, but prefer to use the smoothness constraint to match the used Wasserstein distance. We also\nde\ufb01ne f\u21e4 as an optimal hypothesis that achieves the minimum discrepancy \u21e4, which is given in the\nappendix A.3. Now we come to the main theorem of this work.\nTheorem 3.3 (Bound on weighted multi-source discrepancy). For a hypothesis f : X! [0, 1],\n\ns(f, g) = Ex\u21e0Ds[|f (x)  g(x)|].\n\ns=1 wsDs,DT ) + \u21e4\n\n(11)\n\nT (f, gT ) \uf8ffPS\n\ns=1 wss(f, gs) + \u21b5T +\u21e4(PS\n\nweighted domain combination \u21e4 =PS\n\nThe quantity \u21e4 given in (27) is de\ufb01ned in the appendix and addresses the fundamental mismatch in\ntrue labeling functions, which is uncontrollable by domain adaptation. Note that a weighted sum of\nLipschitz continuous functions is also Lipschitz continuous. \u21e4 is the Lipschitz continuity for the\ns=1 wss, where fs() of domain Ds has Lipschitz constant\ns. We note that in Theorem 3.3 we are dependent on the weighted sum of the source domains,\nimplying that increasing the weight on irrelevant source domain may hurt generalization performance.\nThis matches existing literature. Second, a complex model with high learning capacity will reduce\nthe source error s(f, gs), but the uncertainty introduced by the model will increase the domain\ndiscrepancy measurement \u21b5+\u21e4({Ds}S\nCompared to the multi-source domain adversarial network\u2019s (MDAN\u2019s) [45] bound, T (f, gT ) \uf8ff\nmaxs s(f, gs) + maxs dHH(Ds,DT ) + \u21e4, where the de\ufb01nition of dHH is given in appendix\nsection A.2. Theorem 3.3 reveals that weighting has a tighter bound because an irrelevant domain\nwith little weight will not seriously hurt the generalization bound whereas prior approaches have\ntaken the max over the least relevant domain. Also, the inner domain matching helps prevent spurious\nrelationships between irrelevant domains and the target. Therefore, MDAN can pick out more relevant\nsource domains compared to the alternative methods evaluated.\n\ns=1,DT ).\n\n4 Related Work\n\nThere is a large history in domain adaptation to transfer source distribution information to the\ntarget distribution or vice versa, and has been approached in a variety of manners. Kernel Mean\n\n6\n\n\fMatching (KMM) is widely used in the assumption that target data can be represented by a weighted\ncombination of samples in the source domain [37, 19, 12, 29, 40]. Clustering [25] and late fusion [1]\napproaches have also been evaluated. Distribution matching has been explored with the Minimum\nMean Discrepancy [29] and optimal transport [8, 7], which is similar to the motivation used in our\ndomain penalization.\nWith the increasing use of neural networks, weight sharing and transfer has emerged as an effective\nstrategy for domain adaptation [15]. With the development of Generative Adversarial Networks\n(GANs) [17], adversarial domain adaptation has become popular. The Domain Adversarial Neural\nNetwork (DANN) is a newly proposed model for feature adaptation rather than simple network weight\nsharing [14]. Since its publication, the DANN approach has been generalized [39, 43] and extended\nto multiple domains [45]. In the multiple domain case, a weighted combination of source domains\nis used for adaptation. [22] is based on the DANN framework, but uses distributional summary\nstatistics in the adversary. Several other methods use source or target sample generation with GANs\non single source domain adaptation [35, 27, 26, 33], but extensions to multi-source domains are not\nstraightforward. [3] provides a multi-stage multi-source domain adaptation.\nThere has also been theoretical analysis of error bounds for multi-source domain adaptation. [9]\nanalyzes the theory on distributed weighted combining of multiple source domains. [32] gives a bound\non target loss when only using k-nearest source domains. It shows that adding more uncorrelated\nsource domains training data hurts the generalization bound. The bound that [4] gives is also on the\ntarget risk loss. It introduces the H-divergence as a measurement of the distance between source\nand target domains. [5] further analyzes whether source sample quantity can compensate for quality\nunder different methods and different target error measurements.\nDomain adaptation can be used in a wide variety of applications. [16, 10] uses it for natural language\nprocessing tasks. [12] perform video concept detection using multi-source domain adaptation with\nauxiliary classi\ufb01ers. [15, 14, 1, 3, 39] focus on image domain transfer learning. The multi-source\ndomain adaptation in previous works is usually limited to fewer than \ufb01ve source domains. Some\nscienti\ufb01c applications have more challenging situation by adapting from a signi\ufb01cantly higher number\nof source domains [44]. In some neural signals, different methods have been employed to transfer\namong subjects based on hand crafted EEG features [38, 24]; however, these models need to be\ntrained in several steps, making them less robust.\n\n5 Experiment\n\nWe tested MDMN by applying it to three classi\ufb01cation problems: image recognition, natural-language\nsentiment detection, and multi-channel time series analysis. The sentiment classi\ufb01cation task is given\nin the Appendix due to limited space.\n\n5.1 Results on Image Datasets\n\nWe \ufb01rst test the performance of the proposed MDMN model on MNIST, MNISTM, SVHN and USPS\ndatasets. Visualizations of these datasets are given in the Appendix Section C.1. In each test, one\ndataset is left out as target domain while the remaining three are treated as source domains. The\nfeature extractor E consists of two convolutional layers plus two fully connected layers. Both the\nlabel predictor and domain adapter are two layers MLP. ReLU nonlinearity is used between layers.\nThe baseline method is the concatenation of feature extractor and label predictor as a standard CNN\nbut it has no access to any target domain data during training process.\nWhile TCA [34] and SA [13] methods can process raw images, the results are signi\ufb01cantly stronger\nfollowing a feature extraction step. The results from these methods are given by following two\nindependent steps. First, a convolutional neural network with the same structure as in our proposed\napproach is used as a baseline. This model is trained on the source domains, and then features\nare extracted for all domains to use as inputs into TCA and SA. Another issue is computational\ncomplexity for TCA, because this algorithm computes the matrix inverse during the inference, which\nis of complexity O(N 3). Hence, data was limited for this algorithm. For the adversarial based\nalgorithms [39, 14, 45] and MDMN model, the domain classi\ufb01er is the uniform, which is a two layer\nMLP with ReLU nonlinearities and a soft-max top layer.\n\n7\n\n\f(a) Baseline\n\n(b) DANN\n\n(c) MDANs\n\n(d) MDMN\n\nFigure 4: Visualization of feature spaces of different models by t-SNE. Each color represents one\ndataset of MNIST, MNISTM, SVHN and USPS. The testing target domain is MNISTM. The digit\nlabel is shown in the plot. The goal is to adapt generalized feature from source domains to the target\ndomains; the digits should cluster together rather than the color clustering.\n\n(a) Classi\ufb01cation accuracy on SEED dataset.\n\n(b) Relative classi\ufb01cation accuracy on ASD dataset.\n\nFigure 5: Relative classi\ufb01cation accuracy by subject on two EEG datasets. The accuracy without\nsubtracting the baseline performance is given in appendix C.2.\n\nThe classi\ufb01cation accuracy is compared in Table 1. The top row shows the baseline result on\nthe target domain with the classi\ufb01er trained on the three other datasets. The proposed model\nMDMN outperforms the other baselines on all datasets. Note that some domain-adaptation al-\ngorithms actually lower the accuracy, revealing that domain-speci\ufb01c features are being trans-\nfered. Another problem encountered is the mismatch between the source domain and target\ndomain. For instance, when the target domain is the MNIST-M dataset, it is expected to give\nlarge weight to MNIST dataset samples during training. However, algorithms like TCA, SA\nand DANN equally weight all source domain datasets, making the result worse than MDMN.\n\nAcc. %\nBaseline\nTCA [34]\nSA [13]\nDAN [28]\nADDA [39]\nDANN [14]\nMDANs [45]\nMDMN\n\n94.6\n78.4\n90.8\n97.1\n89.0\n97.9\n97.2\n98.0\n\n60.8\n45.2\n59.9\n67.0\n80.3\n68.8\n68.5\n83.8\n\nMNIST MNISTM USPS\n89.4\n75.4\n86.3\n90.4\n85.2\n89.3\n90.1\n94.5\n\nIf we project the feature vector for\neach data to two dimensions using the\nTSNE embedding [31], the features\nare shown in Figure 4. The goal is\nto mix different colors while distin-\nguishing different digits. The baseline\nmodel in Figure 4(a) shows no adapta-\ntion for the target domain, i.e. the digit\n\u20180\u2019 from USPS and MNIST datasets\nform two islands if domain adaptation\nis not imposed. The DANN model\nand the MDANs model shows some\n\u201cmixing\u201d effect, which indicates that\ndomain adaptation is happening because the extracted features are more similar between domains.\nMDMN has the most clear digit mixing effect. The model \ufb01nds the digit label features instead of\ndomain speci\ufb01c features. A larger \ufb01gure of the same result is given in Appendix C.1 for enhanced\nclarity.\n\nTable 1: Accuracy on image classi\ufb01cation. For the TCA\nmethod, 20% of the data was randomly selected.\n\nSVHN\n43.7\n39.7\n40.2\n51.9\n43.5\n50.1\n50.5\n53.1\n\n5.2 Result on EEG Time Series\n\nTwo datasets are used to evaluate performance on Electroencephalography (EEG) data: SEED dataset\nand an Autism Spectrum Disorder (ASD) dataset.\n\n8\n\n\fThe SEED dataset [46] focuses on analyzing emotion using EEG signal. This dataset has 15 subjects.\nThe EEG signal is recorded when each subject watches 15 movie clips for 3 times at three different\ndays. Each video clip is attached with a negative/neutral/positive emotion label. The sampling rate is\nat 1000Hz and a 62-electrode layout is used. In our experiment, we downsample the EEG signal to\n200Hz. The test scheme is the leave-one-out cross-validation. In each time, one subject is picked out\nas test and the remaining 14 subjects are used as training and validation.\nThe Autism Spectrum Disorder (ASD) dataset [11] aims at discovering whether there are signi\ufb01cant\nchanges in neural activity in a open label clinical trial evaluating the ef\ufb01cacy of a single infusion\nof autologous cord blood for treatment of ASD [11]. The study involves 22 children from ages\n3 to 7 years undergoing treatment for ASD with EEG measurements at baseline (T1), 6 months\npost treatment (T2), and 12 months post treatment (T3). The signal was recorded when a child was\nwatching a total of three one-minute long videos designed to measure responses to dynamic social\nand nonsocial stimuli. The data has 121 signal electrodes. The classi\ufb01cation task is to predict the\ntreatment stage T1, T2 and T3 to test the effectiveness of the treatment and analyze what features are\ndynamic in response to the treatment. By examining the features, we can track how neural changes\ncorrelate to this treatment stages. We also adopt the leave-one-out cross-validation scheme for this\ndataset, where one subject is left out as testing, the remaining 21 subjects are separated as training\nand validation. Leaving complete subjects out better estimates generalization to a population in these\ntypes of neural tasks [42].\nThe classi\ufb01cation accuracy using different methods is compared in Table 2. In this setting, we choose\nour baseline model as the SyncNet [23]. SyncNet is a neural network with structured \ufb01lter targeting\nat extracting neuroscience related features. The simplest framework of SyncNet is adopted which\nonly contains one layer of convolutional \ufb01lters. As in [23], we set the \ufb01lter number to 10 for both\ndatasets. For TCA, SA and ITL methods, the baseline model was trained as before without a domain\nadapter on the source domain data. Extracted features from this model were then used to extract\nfeatures from target domains.\nMDMN outperforms other competitors on both\nEEG datasets. A subject by subject plot is shown\nin Figure 5. Because performance on subjects is\nhighly variable, we only visualize performance\nrelative to baseline, and absolute performance\nis visualized in Figure 8 in the appendix. Be-\ncause the source domains are large but each\nsource domain is highly variable, the require-\nment to \ufb01nd relevant domains is of increased\nimportance on both of the EEG datasets. For the\nASD dataset, DANN and MDANs do not match\nthe performance of MDMN mainly because they\ncannot correctly pick out most related subject\nfrom source domains. This is also true for TCA,\nSA and ITL. Our proposed algorithm MDMN overcomes this problem by computing domain similar-\nity in feature space while performing feature mapping, and a domain relationship graph by subject is\ngiven in Figure 2. Each subject is related to all the others with different weight. The missing edges,\nlike the edges to node \u2018s10\u2019, are those with weight less than 0.09. Our algorithm automatically \ufb01nds\nthe relationship and the domain adaptation happens with the calculated weight, instead of treating all\ndomains equally.\n\nSEED ASD\n62.06\n49.29\n55.65\n39.70\n53.90\n62.53\n54.62\n45.27\n61.88\n50.28\n63.81\n55.87\n63.38\n56.65\n60.59\n67.78\n\nDataset\nSyncNet [23]\nTCA [34]\nSA [13]\nITL [36]\nDAN [28]\nDANN [14]\nMDANs [45]\nMDMN\n\nTable 2: Classi\ufb01cation mean accuracy in percent-\nage on EEG datasets.\n\n6 Conclusion\n\nIn this work, we propose the Multiple Domain Matching Network (MDMN) that uses feature matching\nacross different source domains. MDMN is able to use pairwise domain feature similarity to give\na weight to each training domain, which is of key importance when the number of source domains\nincreases, especially in many neuroscience and biological applications. While performing domain\nadaptation, MDMN can also extract the relationship between domains. The relationship graph itself\nis of interest in many applications. Our proposed adversarial training framework further applies this\nidea on different domain adaptation tasks and shows state-of-the-art performance.\n\n9\n\n\fAcknowledgements\n\nFunding was provided by the Stylli Translational Neuroscience Award, Marcus Foundation, NICHD\nP50-HD093074, and NIMH 3R01MH099192-05S2.\n\nReferences\n[1] S. Ao, X. Li, and C. X. Ling. Fast generalized distillation for semi-supervised domain adaptation.\n\nIn AAAI, 2017.\n\n[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[3] J. T. Ash, R. E. Schapire, and B. E. Engelhardt. Unsupervised domain adaptation using\n\napproximate label matching. arXiv preprint arXiv:1602.04889, 2016.\n\n[4] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of\n\nlearning from different domains. Machine Learning, 2010.\n\n[5] S. Ben-David and R. Urner. Domain adaptation\u2013can quantity compensate for quality? Annals\n\nof Mathematics and Arti\ufb01cial Intelligence, 2014.\n\n[6] M. Chen, Z. Xu, K. Q. Weinberger, and F. Sha. Marginalized stacked denoising autoencoders.\n\nIn Proceedings of the Learning Workshop, Utah, UT, USA, volume 36, 2012.\n\n[7] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy. Joint distribution optimal trans-\n\nportation for domain adaptation. In NIPS, 2017.\n\n[8] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain\n\nadaptation. IEEE PAMI, 2017.\n\n[9] K. Crammer, M. Kearns, and J. Wortman. Learning from multiple sources. JMLR, 2008.\n\n[10] H. Daum\u00e9 III. Frustratingly easy domain adaptation. arXiv preprint arXiv:0907.1815, 2009.\n\n[11] G. Dawson, J. M. Sun, K. S. Davlantis, M. Murias, L. Franz, J. Troy, R. Simmons, M. Sabatos-\nDeVito, R. Durham, and J. Kurtzberg. Autologous cord blood infusions are safe and feasible in\nyoung children with autism spectrum disorder: Results of a single-center phase i open-label\ntrial. Stem Cells Translational Medicine, 2017.\n\n[12] L. Duan, I. W. Tsang, D. Xu, and T.-S. Chua. Domain adaptation from multiple sources via\n\nauxiliary classi\ufb01ers. In ICML. ACM, 2009.\n\n[13] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation\n\nusing subspace alignment. In ICCV, 2013.\n\n[14] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and\n\nV. Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016.\n\n[15] T. Gebru, J. Hoffman, and L. Fei-Fei. Fine-grained recognition in the wild: A multi-task domain\n\nadaptation approach. In ICCV, 2017.\n\n[16] X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classi\ufb01cation:\n\nA deep learning approach. In ICML, 2011.\n\n[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[18] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of\n\nwasserstein gans. arXiv preprint arXiv:1704.00028, 2017.\n\n[19] C.-A. Hou, Y.-H. H. Tsai, Y.-R. Yeh, and Y.-C. F. Wang. Unsupervised domain adaptation with\n\nlabel and structural consistency. IEEE Transactions on Image Processing, 2016.\n\n10\n\n\f[20] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2014.\n\n[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[22] C. Li, D. Alvarez-Melis, K. Xu, S. Jegelka, and S. Sra. Distributional adversarial networks.\n\narXiv preprint arXiv:1706.09549, 2017.\n\n[23] Y. Li, M. Murias, S. Major, G. Dawson, K. Dzirasa, L. Carin, and D. E. Carlson. Targeting\n\neeg/lfp synchrony with neural nets. In NIPS, 2017.\n\n[24] Y.-P. Lin and T.-P. Jung. Improving eeg-based emotion classi\ufb01cation using conditional transfer\n\nlearning. Frontiers in human neuroscience, 2017.\n\n[25] H. Liu, M. Shao, and Y. Fu. Structure-preserved multi-source domain adaptation. In ICDM.\n\nIEEE, 2016.\n\n[26] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In\n\nNIPS, 2017.\n\n[27] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In NIPS, 2016.\n\n[28] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation\n\nnetworks. In ICML, 2016.\n\n[29] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Unsupervised domain adaptation with residual\n\ntransfer networks. In NIPS, 2016.\n\n[30] M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation\n\nnetworks. In ICML, 2017.\n\n[31] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. JMLR, 2008.\n\n[32] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In\n\nNIPS, 2009.\n\n[33] S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto. Few-shot adversarial domain adaptation.\n\nIn NIPS, 2017.\n\n[34] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component\n\nanalysis. IEEE Transactions on Neural Networks, 2011.\n\n[35] P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo. From source to target and back: symmetric\n\nbi-directional adaptive gan. arXiv preprint arXiv:1705.08824, 2017.\n\n[36] Y. Shi and F. Sha. Information-theoretical learning of discriminative clusters for unsupervised\n\ndomain adaptation. In ICML, 2012.\n\n[37] Q. Sun, R. Chattopadhyay, S. Panchanathan, and J. Ye. A two-stage weighting framework for\n\nmulti-source domain adaptation. In NIPS, 2011.\n\n[38] W. Tu and S. Sun. A subject transfer framework for eeg classi\ufb01cation. Neurocomputing, 2012.\n\n[39] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation,\n\n2017.\n\n[40] H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan. Deep hashing network for\n\nunsupervised domain adaptation. In CVPR, 2017.\n\n[41] C. Villani. Optimal transport: old and new. Springer Science & Business Media, 2008.\n\n[42] M.-A. T. Vu, T. Adali, D. Ba, G. Buzsaki, D. Carlson, K. Heller, C. Liston, C. Rudin, V. So-\nhal, A. S. Widge, et al. A shared vision for machine learning in neuroscience. Journal of\nNeuroscience, 2018.\n\n11\n\n\f[43] Q. Xie, Z. Dai, Y. Du, E. Hovy, and G. Neubig. Adversarial invariant feature learning. In NIPS,\n\n2017.\n\n[44] H. Xu, A. Lorbert, P. J. Ramadge, J. S. Guntupalli, and J. V. Haxby. Regularized hyperalignment\n\nof multi-set fmri data. In Statistical Signal Processing Workshop (SSP). IEEE, 2012.\n\n[45] H. Zhao, S. Zhang, G. Wu, J. P. Costeira, J. M. Moura, and G. J. Gordon. Multiple source domain\nadaptation with adversarial training of neural networks. arXiv preprint arXiv:1705.09684, 2017.\n[46] W.-L. Zheng and B.-L. Lu. Investigating critical frequency bands and channels for eeg-based\nemotion recognition with deep neural networks. IEEE Transactions on Autonomous Mental\nDevelopment, 2015.\n\n12\n\n\f", "award": [], "sourceid": 3411, "authors": [{"given_name": "Yitong", "family_name": "Li", "institution": "Duke University"}, {"given_name": "michael", "family_name": "Murias", "institution": "Duke University"}, {"given_name": "geraldine", "family_name": "Dawson", "institution": "Duke University"}, {"given_name": "David", "family_name": "Carlson", "institution": "Duke University"}]}