{"title": "A Two-Stage Weighting Framework for Multi-Source Domain Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 505, "page_last": 513, "abstract": "Discriminative learning when training and test data belong to different distributions is a challenging and complex task. Often times we have very few or no labeled data from the test or target distribution but may have plenty of labeled data from multiple related sources with different distributions. The difference in distributions may be in both marginal and conditional probabilities. Most of the existing domain adaptation work focuses on the marginal probability distribution difference between the domains, assuming that the conditional probabilities are similar. However in many real world applications, conditional probability distribution differences are as commonplace as marginal probability differences. In this paper we propose a two-stage domain adaptation methodology which combines weighted data from multiple sources based on marginal probability differences (first stage) as well as conditional probability differences (second stage), with the target domain data. The weights for minimizing the marginal probability differences are estimated independently, while the weights for minimizing conditional probability differences are computed simultaneously by exploiting the potential interaction among multiple sources. We also provide a theoretical analysis on the generalization performance of the proposed multi-source domain adaptation formulation using the weighted Rademacher complexity measure. Empirical comparisons with existing state-of-the-art domain adaptation methods using three real-world datasets demonstrate the effectiveness of the proposed approach.", "full_text": "A Two-Stage Weighting Framework for Multi-Source\n\nDomain Adaptation\n\nQian Sun\u2217, Rita Chattopadhyay\u2217, Sethuraman Panchanathan, Jieping Ye\n\nComputer Science and Engineering, Arizona State University, AZ 85287\n{Qian Sun, rchattop, panch, Jieping.Ye}@asu.edu\n\nAbstract\n\nDiscriminative learning when training and test data belong to different distribu-\ntions is a challenging and complex task. Often times we have very few or no\nlabeled data from the test or target distribution but may have plenty of labeled\ndata from multiple related sources with different distributions. The difference in\ndistributions may be both in marginal and conditional probabilities. Most of the\nexisting domain adaptation work focuses on the marginal probability distribution\ndifference between the domains, assuming that the conditional probabilities are\nsimilar. However in many real world applications, conditional probability dis-\ntribution differences are as commonplace as marginal probability differences. In\nthis paper we propose a two-stage domain adaptation methodology which com-\nbines weighted data from multiple sources based on marginal probability differ-\nences (\ufb01rst stage) as well as conditional probability differences (second stage),\nwith the target domain data. The weights for minimizing the marginal probability\ndifferences are estimated independently, while the weights for minimizing condi-\ntional probability differences are computed simultaneously by exploiting the po-\ntential interaction among multiple sources. We also provide a theoretical analysis\non the generalization performance of the proposed multi-source domain adapta-\ntion formulation using the weighted Rademacher complexity measure. Empirical\ncomparisons with existing state-of-the-art domain adaptation methods using three\nreal-world datasets demonstrate the effectiveness of the proposed approach.\n\n1\n\nIntroduction\n\nWe consider the domain adaptation scenarios where we have very few or no labeled data from target\ndomain but a large amount of labeled data from multiple related source domains with different data\ndistributions. Under such situations, learning a single or multiple hypotheses on the source domains\nusing traditional machine learning methodologies and applying them on target domain data may lead\nto poor prediction performance. This is because traditional machine learning algorithms assume that\nboth the source and target domain data are drawn i.i.d. from the same distribution. Figure 1 shows\ntwo such source distributions, along with their hypotheses obtained based on traditional machine\nlearning methodologies and a target data distribution. It is evident that the hypotheses learned by\nthe two source distributions D1 and D2 would perform poorly on the target domain data.\nOne effective approach under such situations is domain adaptation, which enables transfer of knowl-\nedge between the source and target domains with dissimilar distributions [1]. It has been applied\nsuccessfully in various applications including text classi\ufb01cation (parts of speech tagging, webpage\ntagging, etc) [2], video concept detection across different TV channels [3], sentiment analysis (iden-\ntifying positive and negative reviews across domains) [4] and WiFi Localization (locating device\nlocation depending upon the signal strengths from various access points) [5].\n\n\u2217Authors contributed equally.\n\n1\n\n\fFigure 1: Two source domains D1 and D2 and target domain data with different marginal and\nconditional probability differences, along with con\ufb02icting conditional probabilities (the red squares\nand blue triangles refer to the positive and negative classes).\nMany existing methods re-weight source domain data in order to minimize the marginal probability\ndifferences between the source and target domains and learn a hypothesis on the re-weighted source\ndata [6, 7, 8, 9]. However they assume that the distributions differ only in marginal probabilities but\nthe conditional probabilities remain the same. There are other methods that learn model parameters\nto reduce marginal probability differences [10, 11]. Similarly, several algorithms have been devel-\noped in the past to combine knowledge from multiple sources [12, 13, 14]. Most of these methods\nmeasure the distribution difference between each source and target domain data, independently,\nbased on marginal or conditional probability differences and combine the hypotheses generated by\neach of them on the basis of the respective similarity factors. However the example in Figure 1\ndemonstrates the importance of considering both marginal and conditional probability differences\nin multi-source domain adaptation.\nIn this paper we propose a two-stage multi-source domain adaptation framework which computes\nweights for the data samples from multiple sources to reduce both marginal and conditional prob-\nability differences between the source and target domains. In the \ufb01rst stage, we compute weights\nof the source domain data samples to reduce the marginal probability differences, using Maximum\nMean Discrepancy (MMD) [15, 6] as the measure. The second stage computes the weights of\nmultiple sources to reduce the conditional probability differences; the computation is based on the\nsmoothness assumption on the conditional probability distribution of the target domain data [16].\nFinally, a target classi\ufb01er is learned on the re-weighted source domain data. A novel feature of our\nweighting methodologies is that no labeled data is needed from the target domain, thus widening the\nscope of their applicability. The proposed framework is readily extendable to the case where a few\nlabeled data may be available from the target domain.\nIn addition, we present a detailed theoretical analysis on the generalization performance of our\nproposed framework. The error bound of the proposed target classi\ufb01er is based on the weighted\nRademacher complexity measure of a class of functions or hypotheses, de\ufb01ned over a weighted\nsample space [17, 18]. The Rademacher complexity measures the ability of a class of functions to\n\ufb01t noise. The empirical Rademacher complexity is data-dependent and can be measured from \ufb01nite\nsamples. It can lead to tighter bounds than those based on other complexity measures such as the\nVC-dimension. Theoretical analysis of domain adaptation has been studied in [19, 20]. In [19], the\nauthors provided the generalization bound based on the VC dimension for both single-source and\nmulti-source domain adaptation. The results were extended in [20] to a broader range of prediction\nproblems based on the Rademacher complexity; however only the single-source case was analyzed\nin [20]. We extend the analysis in [19, 20] to provide the generalization bound for our proposed\ntwo-stage framework based on the weighted Rademacher complexity; our generalization bound is\ntighter than the previous ones in the multi-source case. Our theoretical analysis also reveals the\nkey properties of our generalization bound in terms of a differential weight \u00b5 between the weighted\nsource and target samples.\nWe have performed extensive experiments using three real-world datasets including 20 Newsgroups,\nSentiment Analysis data and one dataset of multi-dimensional feature vectors extracted from Surface\nElectromyigram (SEMG) signals from eight subjects. SEMG signals are recorded using surface\nelectrodes, from the muscle of a subject, during a submaximal repetitive gripping activity, to detect\nstages of fatigue. Our empirical results demonstrate superior performance of the proposed approach\nover the existing state-of-the-art domain adaptation methods; our results also reveal the effect of the\ndifferential weight \u00b5 on the target classi\ufb01er performance.\n\n2\n\n\u22122024681012024681012D1: Source Domain 1(a)\u22122024681012024681012D2: Source Domain 2(b)\u22122024681012024681012Target Domain(c)A5A6A4A3A2A1\fi is the i-th feature vector, ys\ni |nu\ni=1 and optionally a few labeled data DT\n\n2 Proposed Approach\nWe consider the following multi-source domain adaptation setting. There are k auxiliary source\ni=1, s = 1, 2,\u00b7\u00b7\u00b7 k,\ndomains. Each source domain is associated with a sample set Ds = (xs\ni is the corresponding class label, ns is the sample size of the\nwhere xs\ns-th source domain, and k is the total number of source domains. The target domain consists of\ni=1. Here\nplenty of unlabeled data DT\ni , yT\nnu and nl are the numbers of unlabeled and labeled data, respectively. Denote DT = DT\nu and\nl\nnT = nl + nu. The goal is to build a classi\ufb01er for the target domain data using the source domain\ndata and a few labeled target domain data, if available.\nThe proposed approach consists of two stages. In the \ufb01rst stage, we compute the weights of source\ndomain data based on the marginal probability difference; in the second stage, we compute the\nweights of source domains based on the conditional probability difference. A target domain classi\ufb01er\nis learned on these re-weighted data.\n\n(cid:83) DT\n\nl = (xT\n\ni )|nl\n\nu = xT\n\ni )|ns\n\ni , ys\n\n2.1 Re-weighting data samples based on marginal probability differences\n\nThe difference between the means of two distributions after mapping onto a reproducing kernel\nHilbert space, called Maximum Mean Discrepancy, has been shown to be an effective measure of\nthe differences in their marginal probability distributions [15]. We use this measure to compute the\nweights \u03b1s\n\ni \u2019s of the s-th source domain data by solving the following optimization problem [6]:\n\nnT(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nH\n\n\u03b1s\ni \u03a6(xs\n\ni ) \u2212 1\nnT\n\n\u03a6(xT\ni )\n\n(1)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nns(cid:88)\nns\ni \u2265 0\n\ni=1\n\nmin\n\u03b1s\n\ns.t. \u03b1s\n\nwhere \u03a6(x) is a feature map onto a reproducing kernel Hilbert space H [21], ns is the number of\nsamples in the s-th source domain, nT is the number of samples in the target domain, and \u03b1s is the\nns dimensional weight vector. The minimization problem is a standard quadratic problem and can\nbe solved by applying many existing solvers.\n\n2.2 Re-weighting Sources based on Conditional probability differences\n\nIn the second stage the proposed framework modulates the \u03b1s weights of a source domain s obtained\non the basis of marginal probability differences in the \ufb01rst stage, with another weighting factor given\nby \u03b2s. The weight \u03b2s re\ufb02ects the similarity of a particular source domain s to the target domain\nwith respect to conditional probability distributions.\nNext, we show how to estimate the weights \u03b2s. For each of the k source domains, a hypothesis\nhs : X \u2192 Y is learned on the \u03b1s re-weighted source data samples. This ensures that the hypothesis\nis learned on source data samples with similar marginal probability distributions. These k source\ndomain hypotheses are used to predict the unlabeled target domain data DT\ni =\ni ] be the 1\u00d7 k vector of predicted labels of k source domain hypotheses for the i-th sample\ni \u00b7\u00b7\u00b7 hk\n[h1\nof target domain data. Let \u03b2 = [\u03b21 \u00b7\u00b7\u00b7 \u03b2k](cid:48) be the k \u00d7 1 weight vector, where \u03b2s is the weight\ncorresponding to the s-th source hypothesis. The estimation of the weight for each source domain\nhypothesis hs is based on the smoothness assumption on the conditional probability distribution\nof the target domain data [16]; speci\ufb01cally we aim to \ufb01nd the optimal weights by minimizing the\ndifference in predicted labels between two nearby points in the target domain as follows.\n\ni |nu\ni=1. Let H S\n\nu = xT\n\n(H S\n\ni \u03b2 \u2212 H S\n\nj \u03b2)2Wij\n\n(2)\n\nnu(cid:88)\n\ni,j=1\n\n(cid:48)\n\nmin\ne=1,\u03b2\u22650\n\n\u03b2:\u03b2\n\nwhere H S is an n \u00d7 k matrix with each row of H S given by H S\ni \u03b2 and H S\nj \u03b2\nare the predicted labels for the i-th and j-th samples of target domain data obtained by following\na \u03b2 weighted ensemble methodology over all k sources, and Wij is the similarity between the two\ntarget domain data samples. We can rewrite the minimization problem as follows:\n\ni as de\ufb01ned above, H S\n\n(cid:48)\n\nmin\ne=1,\u03b2\u22650\n\n\u03b2:\u03b2\n\n(cid:48)\n\nH S(cid:48)\n\n\u03b2\n\nLuH S\u03b2\n\n3\n\n(3)\n\n\fis the diagonal matrix given by Dii = (cid:80)n\n\nwhere Lu is the graph Laplacian associated with the target domain data DT\nwhere W is the similarity matrix de\ufb01ning edge weights between the data samples in DT\n\nu , given by Lu = D\u2212 W ,\nu , and D\nj=1 Wij. The minimization problem in (3) is a standard\n\nquadratic problem (QP) and can be solved ef\ufb01ciently by applying many existing solvers.\nTo illustrate the proposed two-stage framework, we demonstrate the effect of re-weighting data\nsamples in source domains D1 and D2 of the toy dataset (shown in Figure 1), based on the computed\nweights, in the supplemental material.\n\n2.3 Learning the Target Classi\ufb01er\n\nThe target classi\ufb01er is learned based on the re-weighted source data and a few labeled target domain\ndata (if available). We also incorporate an additional weighting factor \u00b5 to provide a differential\nweight to the source domain data with respect to the labeled target domain data. Mathematically,\nthe target classi\ufb01er \u02c6h is learnt by solving the following optimization problem:\n\nns(cid:88)\n\nk(cid:88)\n\n\u03b2s\nns\n\nnl(cid:88)\n\n\u02c6h = argmin\n\n\u00b5\n\nh\n\ns=1\n\ni=1\n\nj=1\n\niL(h(xs\n\u03b1s\n\ni ), ys\n\ni ) +\n\nL(h(xT\n\nj ), yT\nj )\n\n1\nnl\n\n(4)\n\nwhere nl is the number of labeled data from the target domain.\nWe refer to the proposed framework as 2-Stage Weighting framework for Multi-Source Domain\nAdaptation (2SW-MDA). Algorithm 1 below summarizes the main steps involved in 2SW-MDA.\n\nAlgorithm 1 2SW-MDA\n1: for s = 1, . . . ,k do\nCompute \u03b1s by solving (1)\n2:\nLearn a hypothesis hs on the \u03b1s weighted source data\n3:\n4: end for\n5: Form the nu \u00d7 k prediction matrix H S as in Section 2.2\n6: Compute matrices W , D and L using the unlabeled target data DT\nu\n7: Compute \u03b2s by solving (3)\n8: Learn the target classi\ufb01er \u02c6h by solving (4)\n\nnl(cid:88)\n\nns(cid:88)\n\nk(cid:88)\n\n\u03b2s\nns\n\n3 Theoretical Analysis\nFor convenience of presentation, we rewrite the empirical joint error function on (\u03b1, \u03b2)-weighted\nsource domain and the target domain de\ufb01ned in (4) as follows:\n\n\u02c6ES\n\n\u03b1,\u03b2(h) = \u00b5\u02c6\u0001\u03b1,\u03b2(h) + \u02c6\u0001T (h) = \u00b5\n\niL(h(xs\n\u03b1s\n\ni ), fs(xs\n\ni )) +\n\nL(h(x0\n\ni ), f0(x0\n\ni )) (5)\n\n1\nnl\n\ns=1\n\ni=1\n\ni=1\n\ni = fs(xs\ni = f0(x0\n\ni ) and fs is the labeling function for source s, \u00b5 > 0, (x0\ni ) and f0 is the labeling function for the target domain, and S = (xs\n\ni ) are samples from the\nwhere ys\ni ) include all\ntarget, yt\nsamples from the target and source domains. The true (\u03b1, \u03b2)-weighted error \u0001\u03b1,\u03b2(h) on weighted\nsource domain samples is de\ufb01ned analogously. Similarly, we de\ufb01ne ES\n\u03b1,\u03b2(h) as the true joint error\nfunction. For notational simplicity, denote n0 = nl as the number of labeled samples from the target,\ni /ns for\ns \u2265 1 and \u03b3i\n\ns=0 ns as the total number of samples from both source and target, and \u03b3i\n\ns = 1/n for s = 0. Then we can re-write the empirical joint error function in (5) as:\n\nm =(cid:80)k\n\ns = \u00b5\u03b2s\u03b1s\n\nk(cid:88)\n\nns(cid:88)\n\n\u02c6ES\n\n\u03b1,\u03b2(h) =\n\ni L(h(xs\n\u03b3s\n\ni ), fs(xs\n\ni )).\n\ns=0\n\ni=1\n\nNext, we bound the difference between the true joint error function ES\nmate \u02c6ES\n\n\u03b1,\u03b2(h) using the weighted Rademacher complexity measure [17, 18] de\ufb01ned as follows:\n\n\u03b1,\u03b2(h) and its empirical esti-\n\n4\n\n\f(cid:34)\n\nDe\ufb01nition 1. (Weighted Rademacher Complexity) Let H be a set of real-valued functions de\ufb01ned\nover a set X. Given a sample S \u2208 X m, the empirical weighted Rademacher complexity of H is\nde\ufb01ned as follows:\n\nns(cid:88)\ni \u03c3s\n\u03b3s\nsup\nh\u2208H\ni } are independent uniform random variables\ni } where {\u03c3s\nThe expectation is taken over \u03c3 = {\u03c3s\ntaking values in {\u22121, +1}. The weighted Rademacher complexity of a hypothesis set H is de\ufb01ned\nas the expectation of \u02c6(cid:60)S(H) over all samples of size m:\n\n\u02c6(cid:60)S(H) = E\u03c3\n\n| k(cid:88)\n\n(cid:35)\n\ns=0\n\ni )\n\ni=1\n\n.\n\ni )|\ni h(xs\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) S = (xs\n(cid:12)(cid:12)(cid:12)|S| = m\n(cid:105)\n\n.\n\n(cid:104) \u02c6(cid:60)S(H)\n\n(cid:60)m(H) = ES\n\nOur main result is summarized in the following lemma, which involves the estimation of the\nRademacher complexity of the following class of functions:\n\nG = {x (cid:55)\u2192 L(h(cid:48)(x), h(x)) : h, h(cid:48) \u2208 H}.\n\nLemma 1. Let H be a family of functions taking values in {\u22121, +1}. Then, for any \u03b4 > 0, with\nprobability at least 1 \u2212 \u03b4, the following holds for h \u2208 H:\n\n(cid:12)(cid:12)(cid:12)ES\n\n\u03b1,\u03b2(h) \u2212 \u02c6ES\n\n\u03b1,\u03b2(h)\n\n(cid:12)(cid:12)(cid:12)ES\n\n\u03b1,\u03b2(h) \u2212 \u02c6ES\n\n\u03b1,\u03b2(h)\n\n(cid:12)(cid:12)(cid:12) \u2264\n\n(cid:118)(cid:117)(cid:117)(cid:116)(cid:16)(cid:80)k\n(cid:17)\n\n(cid:12)(cid:12)(cid:12) \u2264 IRS(H) +\n(cid:118)(cid:117)(cid:117)(cid:116)(cid:16)(cid:80)k\n(cid:80)ns\n\ns=0\n\n(cid:80)ns\n\n(cid:17)\n\ns=0\n\ni )2\n\ni=1(\u03b3s\n2\n\nlog(2/\u03b4)\n\n.\n\n(cid:18)(cid:114)\n\n(cid:19)\n\nlog(2/\u03b4)\n\ni )2\n\ni=1(\u03b3s\n2\n\n2d log\n\nem\nd\n\n+ 1\n\n,\n\nFurthermore, if H has a VC dimension of d, then the following holds with probability at least 1 \u2212 \u03b4:\n\nwhere e is the natural number.\n\nThe proof is provided in Section A of the supplemental material.\n\n3.1 Error bound on target domain data\n\nIn the previous section we presented an upper bound on the difference between the true joint error\nfunction and its empirical estimate and established its relation to the weighting factors \u03b3s\ni . Next we\npresent our main theoretical result, i.e., an upper bound of the error function on target domain data,\ni.e., an upper bound of \u0001T (\u02c6h). We need the following de\ufb01nition of divergence for our main result:\nDe\ufb01nition 2. For a hypothesis space H, the symmetric difference hypothesis space dH\u2206H is the set\nof hypotheses\n\ng \u2208 H\u2206H \u21d4 g(x) = h(x) \u2295 h\n(cid:48)\n\n(x) f or some h, h\n\n(cid:48) \u2208 H,\n\nwhere \u2295 is the XOR function. In other words, every hypothesis g \u2208 H\u2206H is the set of disagreements\nbetween two hypotheses in H.\nThe H\u2206H-divergence between any two distributions DS and DT is de\ufb01ned as\n\ndH\u2206H (DS, DT )) = 2 sup\nh,h(cid:48)\u2208H\n\n|P rx(cid:118)DS [h(x) (cid:54)= h(cid:48)(x)] \u2212 P rx(cid:118)DT [h(x) (cid:54)= h(cid:48)(x)]| .\n\nTheorem 1. Let \u02c6h \u2208 H be an empirical minimizer of the joint error function on similarity weighted\nsource domain and the target domain:\n\n\u02c6h = arg min\nh\u2208H\nfor \ufb01xed weights \u00b5, \u03b1, and \u03b2 and let h\u2217\nT = minh\u2208H \u0001T (h) be a target error minimizer. Then for any\n\u03b4 \u2208 (0, 1), the following holds with probability at least 1 \u2212 \u03b4:\n\n\u02c6E\u03b1,\u03b2(h) \u2261 \u00b5\u02c6\u0001\u03b1,\u03b2(h) + \u02c6\u0001T (h)\n\n(cid:118)(cid:117)(cid:117)(cid:116)(cid:16)(cid:80)k\n\n2\n\ns=0\n\n(cid:80)ns\n\n(cid:17)\n\n\u0001T (\u02c6h) \u2264 \u0001T (h\u2217\n\u00b5\n\n+\n\n1 + \u00b5\n\n2(cid:60)S(H)\n1 + \u00b5\n\n+\n\nT ) +\n(2\u03bb\u03b1,\u03b2 + dH\u2206H (D\u03b1,\u03b2, DT )) ,\n\n1 + \u00b5\n\ni )2\n\ni=1(\u03b3s\n2\n\nlog(2/\u03b4)\n\n(6)\n\n5\n\n\fif H has a VC dimension of d, then the following holds with probability at least 1 \u2212 \u03b4:\n\n\u0001T (\u02c6h) \u2264 \u0001T (h\u2217\n\nT ) +\n\n2\n\n1 + \u00b5\n\nlog(2/\u03b4)\n\ni )2\n\ni=1(\u03b3s\n2\n\n2d log\n\nem\nd\n\n+ 1\n\n\uf8eb\uf8ec\uf8ec\uf8ed\n(cid:118)(cid:117)(cid:117)(cid:116)(cid:16)(cid:80)k\n\ns=0\n\n(cid:80)ns\n\n(cid:17)\n\n(cid:18)(cid:114)\n\n(cid:19)\uf8f6\uf8f7\uf8f7\uf8f8\n\n\u00b5\n\n+\n\n(2\u03bb\u03b1,\u03b2 + dH\u2206H (D\u03b1,\u03b2, DT )) ,\n\n(7)\nwhere \u03bb\u03b1,\u03b2 = minh\u2208H{\u0001T (h) + \u0001\u03b1,\u03b2(h)}, and dH\u2206H (D\u03b1,\u03b2, DT )) is the symmetric difference hy-\npothesis space for (\u03b1, \u03b2)-weighted source and target domain data.\n\n1 + \u00b5\n\nThe proof as well as a comparison with the result in [19] is provided in the supplemental material.\nWe observe that \u00b5 and the divergence between the weighted source and target data play signi\ufb01cant\nroles in the generalization bound. Our proposed two-stage weighting scheme aims to reduce the\ndivergence. Next, we analyze the effect of \u00b5. When \u00b5 = 0, the bound reduces to the generalization\nbound using the nl training samples in the target domain only. As \u00b5 increases, the effect of the source\ndomain data increases. Speci\ufb01cally, when \u00b5 is larger than a certain value, for the bound in (7), as \u00b5\nincreases, the second term will reduce, while the last term capturing the divergence will increase. In\nthe extreme case when \u00b5 = \u221e, the second term in (7) can be shown to be the generalization bound\nusing the weighted samples in the source domain only (the target data will not be effective in this\ncase), and the last term equals to 2\u03bb\u03b1,\u03b2 + dH\u2206H (D\u03b1,\u03b2, DT ). Thus, effective transfer is possible in\nthis case only if the divergence is small. We also observed in our experiments that the target domain\nerror of the learned joint hypothesis follows a bell shaped curve; it has a different optimal point for\neach dataset under certain similarity and divergence measures.\n\n4 Empirical evaluations\n\nDatasets. We evaluate the proposed 2SW-MDA method on three real-world datasets and the\ntoy data shown in Figure 1. The toy dataset is generated using a mixture of Gaussian distribu-\ntions. It has two classes and three domains, as shown in Figure 1. The two source domains D1\nand D2 were created to have both conditional and marginal probability differences with the target\ndomain data so as to provide an ideal testbed for the proposed domain adaptation methodology.\nThe three real-world datasets used are 20 Newsgroups1, Sentiment Analysis2 and another dataset\nof multi-dimensional feature vectors extracted from SEMG (Surface electromyogram) signals. The\n20 Newsgroups dataset is a collection of approximately 20,000 newsgroup documents, partitioned\n(nearly) evenly across 20 different categories. We represented each document as a binary vector\nof the 100 most discriminating words determined by Weka\u2019s info-gain \ufb01lter [22]. Out of the 20\ncategories, we used 13 categories, to form the source and target domains. For each of these cat-\negories the negative class was formed by a random mixture of the rest of the seven categories, as\nsuggested in [23]. The details of the 13 categories used can be found in the supplemental material.\nThe Sentiment Analysis dataset contains positive and negative reviews on four categories (or do-\nmains) including kitchen, book, dvd, and electronics. We processed the Sentiment Analysis dataset\nto reduce the feature dimension to 200 using a cutoff document frequency of 50.\nThe SEMG dataset is 12-dimensional time and frequency domain features derieved from Surface\nElectromyogram (SEMG) physiological signals. SEMG are biosignals recorded from the muscle\nof a subject using surface electrodes to study the muscoskeletal activities of the subject under test.\nSEMG signals used in our experiments, are recorded from extensor carpi radialis muscle during a\nsubmaximal repetitive gripping activity, to study different stages of fatigue. Data is collected from\n8 subjects. Each subject data forms a domain. There are 4 classes de\ufb01ning various stages of fatigue.\nData from a target subject is classi\ufb01ed using the data from the remaining 7 subjects, which form the\nmultiple source domains.\nCompeting Methods. To evaluate the effectiveness of our approach we compare 2SW-MDA with a\nbaseline method SVM-C as well as with \ufb01ve state-of-the-art domain adaptation methods. In SVM-C,\n\n1Available at http://www.ai.mit.edu/(cid:118)jrennie/20Newsgroups/\n2Available at http://www.cs.jhu.edu/(cid:118)mdredze/\n\n6\n\n\fthe training data comprises of data from all source domains (12 for 20 Newsgroups data) and the test\ndata is from the remaining one domain as indicated in the \ufb01rst column of the results in Table 1. The\nrecently proposed multi-source domain adaptation methods used for comparison include Locally\nWeighted Ensemble (LWE) [14] and Domain Adaptation Machine (DAM) [13]. To evaluate the\neffectiveness of multi-source domain adaption, we also compared with three other state-of-the-art\nsingle-source domain adaptation methods, including Kernel Mean Matching (KMM) [6], Transfer\nComponent Analysis (TCA) [11] and Kernel Ensemble (KE) [24].\nExperimental Setup. Recall that one of the appealing features of the proposed method is that\nit requires very few or no labeled target domain data. In our experiments, we used only 1 labeled\nsample per class from the target domain. The results of the proposed 2SW-MDA method are based\non \u00b5 = 1 (see Figure 2 for results on varying \u00b5). Each experiment was repeated 10 times with\nrandom selections of the labeled data. For each experiment, the category shown in \ufb01rst column of\nTable 1 was used as the target domain and the rest of the categories as the source domains. Different\ninstances of the 20 Newsgroups categories are different random samples of 100 data samples se-\nlected from the total 500 data samples in the dataset. Different instances of SEMG dataset are data\nbelonging to different subjects used as target data. Details about the parameter settings are included\nin the supplemental material.\n\nTCA\n\nDAM 2SW-MDA\n\n73.49%\n65.06%\n62.65%\n63.67%\n60.87%\n68.12%\n62.92%\n60.67%\n64.04%\n79.78%\n60.22%\n61.24%\n70.55%\n59.44%\n59.47%\n51.11%\n83.03%\n87.96%\n88.96%\n88.49%\n86.14%\n87.10%\n87.08%\n93.01%\n98.54%\n\nDataset\n\ntalk.politics.mideast\n\ntalk.politics.misc\n\ncomp.sys.ibm.pc.hardware\n\nrec.sport.baseball\n\nkitchen\n\nelectronics\n\nbook\ndvd\n\nSEMG- 8 subjects\n\nToy data\n\nKE\n\nLWE\n\nKMM\n\nSVM-C\n46.00% 50.66% 49.01% 45.78% 58.66% 52.03%\n49.33% 49.39% 53.48% 39.75% 56.00% 52.00%\n49.33% 50.27% 54.67% 43.37% 52.04% 51.81%\n48.83% 53.62% 46.77% 62.32% 55.90% 53.22%\n48.22% 51.12% 48.39% 59.42% 53.23% 54.12%\n48.31% 50.72% 55.01% 59.07% 54.83% 54.12%\n48.42% 51.25% 49.50% 50.56% 61.25% 52.50%\n47.44% 51.44% 49.44% 59.55% 57.50% 52.50%\n45.93% 49.88% 48.00% 58.43% 59.75% 57.80%\n56.25% 61.51% 47.50% 61.79% 61.75% 61.25%\n58.75% 50.09% 51.25% 64.04% 57.75% 53.75%\n56.35% 59.26% 56.25% 58.43% 57.83% 55.05%\n35.55% 40.12% 49.38% 64.04% 64.10% 58.61%\n35.95% 42.66% 48.38% 65.55% 54.20% 52.61%\n37.77% 40.12% 49.38% 58.88% 55.01% 54.10%\n36.01% 49.44% 48.77% 50.00% 50.00% 50.61%\n70.76% 67.44% 63.55% 64.94% 66.35% 74.83%\n43.69% 77.54% 74.62% 63.63% 59.94% 81.36%\n50.11% 75.55% 62.50% 64.06% 56.78% 74.77%\n59.65% 81.22% 69.35% 52.68% 73.38% 80.63%\n40.37% 52.48% 65.61% 49.77% 57.48% 76.74%\n59.21% 65.77% 83.92% 70.62% 76.92% 59.21%\n47.13% 60.32% 77.97% 51.13% 55.64% 74.27%\n69.85% 72.81% 79.48% 67.24% 42.79% 84.55%\n60.05% 75.63% 81.40% 68.01% 64.97% 84.27%\n\nTable 1: Comparison of different methods on three real-world and one toy datasets in terms of\nclassi\ufb01cation accuracies (%).\n\nComparative Studies. Table 1 shows the classi\ufb01cation accuracies of different methods on the real-\nworld and the toy datasets. We observe that SVM-C performs poorly for all cases. This may be\nattributed to the distribution difference among the multiple source and target domains. We observe\nthat 20 Newsgroups and Sentiment Analysis datasets have predominantly marginal probability dif-\nferences. In other words, the frequency of a particular word varies from one category of documents\nto another. In contrary physiological signals, such as SEMG are predominantly different in con-\nditional probability distributions due to the high subject based variability in the power spectrum\nof these signals and their variations as fatigue sets in [25, 26]. We also observe that the proposed\n2SW-MDA method outperforms other domain adaptation methods and achieves higher classi\ufb01cation\naccuracies in most cases, specially for the SEMG dataset. The accuracies of an SVM classi\ufb01er, on\nthe toy dataset, when learned only on the source domains D1, D2 individually and on the combined\nsource domains, are 64.08% and 71.84% and 60.05% respectively, while 2SW-MDA achieves an\naccuracy of 98.54%. More results are provided in the supplemental material.\nIt is interesting to note that instance re-weighting method KMM and feature mapping based method\nTCA, which address marginal probability differences between the source and target domains per-\n\n7\n\n\fform better than LWE and KE for both 20 Newsgroups and Sentiment Analysis data. They also\nperform better than DAM, a multi-source domain adaptation method, based on marginal probability\nbased weighted hypotheses combination. It is worthwhile to note that LWE is based on conditional\nprobability differences and KE tries to address both differences. Thus, it is not surprising that LWE\nand KE perform better than KMM and TCA for the SEMG dataset, which is predominantly differ-\nent in conditional probability distributions. DAM too performs better for SEMG signals. However\nthe proposed 2SW-MDA method, which addresses both marginal and conditional probability differ-\nences outperforms all the other methods in most cases. Our experiments verify the effectiveness of\nthe proposed two-stage framework.\nParameter Sensitivity Studies. In this exper-\niment, we study the effect of \u00b5 on the classi\ufb01ca-\ntion performance. Figure 2 shows the variation\nin classi\ufb01cation accuracies for some cases pre-\nsented in Table 1, with varying \u00b5 over a range\n[0 0.001 0.01 0.1 0.3 0.5 1 100 1000]. The x-\naxis of the \ufb01gures are in logarithmic scale. The\nresults for the toy data are included in supple-\nmental material. We can observe from the \ufb01g-\nure that in most cases, the accuracy values in-\ncrease as \u00b5 increases from 0 to an optimal value\nand decreases when \u00b5 further increases. When\n\u00b5 = 0 the target classi\ufb01er is learned only on the\nfew labeled data from the target domain. As \u00b5\nincreases the transfer of knowledge due to the\npresence of additional weighted source data has\na positive impact leading to increase in classi\ufb01-\ncation accuracies in the target domain. We also\nobserve that after a certain value of \u00b5 the classi\ufb01er accuracies drop, due to the distribution differ-\nences between the source and target domains. These experimental results are consistent with the\ntheoretical results established in this paper.\n\nFigure 2: Performance of the proposed 2SW-\nMDA method on 20 Newsgroups dataset and Sen-\ntiment Analysis dataset with varying \u00b5.\n\n5 Conclusion\nDomain adaptation is an important problem that arises in a variety of modern applications where\nlimited or no labeled data is available for a target application. We presented here a novel multi-\nsource domain adaptation framework. The proposed framework computes the weights for the source\ndomain data using a two-step procedure in order to reduce both marginal and conditional probability\ndistribution differences between the source and target domain. We also presented a theoretical error\nbound on the target classi\ufb01er learned on re-weighted data samples from multiple sources. Empirical\ncomparisons with existing state-of-the-art domain adaptation methods demonstrate the effectiveness\nof the proposed approach. As a part of the future work we plan to extend the proposed multi-source\nframework to applications involving other types of physiological signals for developing generalized\nmodels across subjects for emotion and health monitoring [27, 28]. We would also like to extend\nour framework to video and speech based applications, which are commonly affected by distribution\ndifferences [3].\n\nAcknowledgements\n\nThis research is sponsored by NSF IIS-0953662, CCF-1025177, and ONR N00014-11-1-0108.\n\nReferences\n[1] S.J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engi-\n\nneering, 2009.\n\n[2] H. Daum III. Frustratingly easy domain adaptation. In ACL, 2007.\n[3] L. Duan, I.W. Tsang, D. Xu, and S.J. Maybank. Domain transfer svm for video concept detection. In\n\nCVPR, 2009.\n\n8\n\n\u22128\u22126\u22124\u221220246845505560657075log uaccuracy(%) talk.politics.misccomp.sys.ibm.pc.hardwaretalk.politics.mideastdvdbookelectronics\f[4] J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adap-\n\ntation for sentiment classi\ufb01cation. In ACL, 2007.\n\n[5] S.J. Pan, J.T. Kwok, and Q. Yang. Transfer learning via dimensionality reduction. In AAAI 08.\n[6] J. Huang, A.J. Smola, A. Gretton, K.M. Borgwardt, and B. Scholkopf. Correcting sample selection bias\n\nby unlabeled data. In NIPS, volume 19, page 601, 2007.\n\n[7] H. Shimodaira.\n\nImproving predictive inference under covariate shift by weighting the log-likelihood\n\nfunction. In JSPI, 2000.\n\n[8] S. Bickel, M. Br\u00a8uckner, and T. Scheffer. Discriminative learning under covariate shift. In JMLR, 2009.\n[9] C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighing. In NIPS, 2010.\n[10] M. Sugiyama, S. Nakajima, H. Kashima, P.V. Buenau, and M. Kawanabe. Direct importance estimation\n\nwith model selection and its application to covariate shift adaptation. In NIPS, 2008.\n\n[11] S.J. Pan, I.W. Tsang, J.T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. In\n\nIJCAI, 2009.\n\n[12] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In NIPS, 2009.\n[13] L. Duan, I.W. Tsang, D. Xu, and T. Chua. Domain adaptation from multiple sources via auxiliary classi-\n\n\ufb01ers. In ICML, pages 289\u2013296, 2009.\n\n[14] J. Gao, W. Fan, J. Jiang, and J. Han. Knowledge transfer via multiple model local structure mapping. In\n\nKDD, pages 283\u2013291, 2008.\n\n[15] K.M. Borgwardt, A. Gretton, M.J. Rasch, H.P. Kriegel, B. Scholkopf, and A.J. Smola. Integrating struc-\ntured biological data by kernel maximum mean discrepancy. In Bioinformatics, volume 22, pages 49\u201357,\n2006.\n\n[16] R. Chattopadhyay, J. Ye, S. Panchanathan, W. Fan, and I. Davidson. Multi-source domain adaptation and\n\nits application to early detection of fatigue. In KDD, 2011.\n\n[17] P.L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural\n\nresults. JMLR, 3:463\u2013482, 2002.\n\n[18] V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Informa-\n\ntion Theory, 47(5):1902\u20131914, 2001.\n\n[19] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J.W. Vaughan. A theory of learning\n\nfrom different domains. Journal of Mach Learn, 79:151\u2013175, 2010.\n\n[20] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms.\n\nComputing Research Repository, abs/0902.3430, 2009.\n\n[21] I. Steinwart. On the in\ufb02uence of the kernel on the consistency of support vector machines. In JMLR,\n\nvolume 2, page 93, 2002.\n\n[22] I.H. Witten and E. Frank. In Data Mining: Practical Machine Learning Tools with Java Implementations,\n\nSan Francisco, CA, 2000. Morgan Kaufmann.\n\n[23] E. Eaton and M. desJardins. Set-based boosting for instance-level transfer. In IEEE International Con-\n\nference on Data Mining Workshops, 2009.\n\n[24] E. Zhong, W. Fan, J. Peng, K. Zhang, J. Ren, D. Turaga, and O. Verscheure. Cross domain distribution\n\nadaptation via kernel mapping. In KDD, Paris, France, 2009. ACM.\n\n[25] P. Contessa, A. Adam, and C.J. De Luca. Motor unit control and force \ufb02uctuation during fatigue. Journal\n\nof Applied Physiology, April 2009.\n\n[26] B. Gerdle, B. Larsson, and S. Karlsson. Criterion validation of surface EMG variables as fatigue indicators\nusing peak torque: a study of repetitive maximum isokinetic knee extensions. Journal of Electromyogra-\nphy and Kinesiology, 10(4):225\u2013232, August 2000.\n\n[27] E. leon, G. Clarke, V. Callaghan, and F. Sepulveda. A user independent real time emotion recognition\nsystem for software agents in domestic environment. In Engineering Application of Arti\ufb01cial Intelligence,\nApril 2007.\n\n[28] J. Kim and E. Andre. Emotion recognition based on physiological changes in music listening. In Pattern\n\nAnalysis and Machine Intelligence, December 2008.\n\n[29] C. McDiarmid. On the method of bounded differences., volume 5. Cambridge University Press, Cam-\n\nbridge, 1989.\n\n[30] S. Kakade and A. Tewari. Lecture notes of CMSC 35900: Learning theory, Toyota Technological Institute\n\nat Chicago. Spring 2008.\n\n[31] P. Massart. Some applications of concentration inequalities to statistics. Annales de la Faculte des sciences\n\nde ToulouseSciences de Toulouse, IX(2):245\u2013303, 2000.\n\n9\n\n\f", "award": [], "sourceid": 370, "authors": [{"given_name": "Qian", "family_name": "Sun", "institution": null}, {"given_name": "Rita", "family_name": "Chattopadhyay", "institution": null}, {"given_name": "Sethuraman", "family_name": "Panchanathan", "institution": null}, {"given_name": "Jieping", "family_name": "Ye", "institution": null}]}