{"title": "Domain Adaptation with Multiple Sources", "book": "Advances in Neural Information Processing Systems", "page_first": 1041, "page_last": 1048, "abstract": "This paper presents a theoretical analysis of the problem of adaptation with multiple sources. For each source domain, the distribution over the input points as well as a hypothesis with error at most \\epsilon are given. The problem consists of combining these hypotheses to derive a hypothesis with small error with respect to the target domain. We present several theoretical results relating to this problem. In particular, we prove that standard convex combinations of the source hypotheses may in fact perform very poorly and that, instead, combinations weighted by the source distributions benefit from favorable theoretical guarantees. Our main result shows that, remarkably, for any fixed target function, there exists a distribution weighted combining rule that has a loss of at most \\epsilon with respect to *any* target mixture of the source distributions. We further generalize the setting from a single target function to multiple consistent target functions and show the existence of a combining rule with error at most 3\\epsilon. Finally, we report empirical results for a multiple source adaptation problem with a real-world dataset.", "full_text": "Domain Adaptation with Multiple Sources\n\nYishay Mansour\n\nGoogle Research and\n\nTel Aviv Univ.\n\nMehryar Mohri\n\nCourant Institute and\n\nGoogle Research\n\nAfshin Rostamizadeh\n\nCourant Institute\n\nNew York University\n\nmansour@tau.ac.il\n\nmohri@cims.nyu.edu\n\nrostami@cs.nyu.edu\n\nAbstract\n\nThis paper presents a theoretical analysis of the problem of domain adaptation\nwith multiple sources. For each source domain, the distribution over the input\npoints as well as a hypothesis with error at most \u01eb are given. The problem con-\nsists of combining these hypotheses to derive a hypothesis with small error with\nrespect to the target domain. We present several theoretical results relating to\nthis problem. In particular, we prove that standard convex combinations of the\nsource hypotheses may in fact perform very poorly and that, instead, combinations\nweighted by the source distributions bene\ufb01t from favorable theoretical guarantees.\nOur main result shows that, remarkably, for any \ufb01xed target function, there exists\na distribution weighted combining rule that has a loss of at most \u01eb with respect to\nany target mixture of the source distributions. We further generalize the setting\nfrom a single target function to multiple consistent target functions and show the\nexistence of a combining rule with error at most 3\u01eb. Finally, we report empirical\nresults for a multiple source adaptation problem with a real-world dataset.\n\n1 Introduction\n\nA common assumption in theoretical models of learning such as the standard PAC model [16], as\nwell as in the design of learning algorithms, is that training instances are drawn according to the\nsame distribution as the unseen test examples. In practice, however, there are many cases where this\nassumption does not hold. There can be no hope for generalization, of course, when the training and\ntest distributions vastly differ, but when they are less dissimilar, learning can be more successful.\n\nA typical situation is that of domain adaptation where little or no labeled data is at one\u2019s disposal\nfor the target domain, but large amounts of labeled data from a source domain somewhat similar to\nthe target, or hypotheses derived from that source, are available instead. This problem arises in a\nvariety of applications in natural language processing [4, 7, 10], speech processing [8, 9, 11, 13\u201315],\ncomputer vision [12], and many other areas.\n\nThis paper studies the problem of domain adaptation with multiple sources, which has also received\nconsiderable attention in many areas such as natural language processing and speech processing.\nAn example is the problem of sentiment analysis which consists of classifying a text sample such\nas a movie review, restaurant rating, or discussion boards, or other web pages. Information about a\nrelatively small number of domains such as movies or books may be available, but little or none can\nbe found for more dif\ufb01cult domains such as travel.\n\nWe will consider the following problem of multiple source adaptation. For each source i \u2208 [1, k],\nthe learner receives the distribution Di of the input points corresponding to that source as well\nas a hypothesis hi with loss at most \u01eb on that source. The learner\u2019s task consists of combining\nthe k hypotheses hi, i \u2208 [1, k], to derive a hypothesis h with small loss with respect to the target\ndistribution. The target distribution is assumed to be a mixture of the distributions Di. We will\ndiscuss both the case where the mixture is known to the learner and the case where it is unknown.\n\n1\n\n\fNote that the distribution Di is de\ufb01ned over the input points and bears no information about the\nlabels. In practice, Di is estimated from large amounts of unlabeled points typically available from\nsource i.\nAn alternative set-up for domain adaptation with multiple sources is one where the learner is not\nsupplied with a good hypothesis hi for each source but where instead he has access to the labeled\ntraining data for each source domain. A natural solution consists then of combining the raw labeled\ndata from each source domain to form a new sample more representative of the target distribution\nand use that to train a learning algorithm. This set-up and the type of solutions just described\nhave been in fact explored extensively in applications [8, 9, 11, 13\u201315]. However, several empirical\nobservations motivated our study of hypothesis combination, in addition to the theoretical simplicity\nand clarity of this framework.\n\nFirst, in some applications such as very large-vocabulary speech recognition, often the original raw\ndata used to derive each domain-dependent model is no more available [2, 9]. This is because such\nmodels are typically obtained as a result of training based on many hours of speech with \ufb01les oc-\ncupying hundreds of gigabytes of disk space, while the models derived require orders of magnitude\nless space. Thus, combining raw labeled data sets is not possible in such cases. Secondly, a com-\nbined data set can be substantially larger than each domain-speci\ufb01c data set, which can signi\ufb01cantly\nincrease the computational cost of training and make it prohibitive for some algorithms. Thirdly,\ncombining labeled data sets requires the mixture parameters of the target distribution to be known,\nbut it is not clear how to produce a hypothesis with a low error rate with respect to any mixture\ndistribution.\n\nFew theoretical studies have been devoted to the problem of adaptation with multiple sources. Ben-\nDavid et al. [1] gave bounds for single source adaptation, then Blitzer et al. [3] extended the work\nto give a bound on the error rate of a hypothesis derived from a weighted combination of the source\ndata sets for the speci\ufb01c case of empirical risk minimization. Crammer et al. [5, 6] also addressed\na problem where multiple sources are present but the nature of the problem differs from adaptation\nsince the distribution of the input points is the same for all these sources, only the labels change\ndue to varying amounts of noise. We are not aware of a prior theoretical study of the problem of\nadaptation with multiple sources analyzed here.\n\nWe present several theoretical results relating to this problem. We examine two types of hypothesis\ncombination. The \ufb01rst type is simply based on convex combinations of the k hypotheses hi. We\nshow that this natural and widely used hypothesis combination may in fact perform very poorly in\nour setting. Namely, we give a simple example of two distributions and two matching hypotheses,\neach with zero error for their respective distribution, but such that any convex combination has\nexpected absolute loss of 1/2 for the equal mixture of the distributions. This points out a potentially\nsigni\ufb01cant weakness of a convex combination.\n\nThe second type of hypothesis combination, which is the main one we will study in this work,\ntakes into account the probabilities derived from the distributions. Namely, the weight of hypothesis\nhi on an input x is proportional to \u03bbiDi(x), were \u03bb is the set of mixture weights. We will refer\nto this method as the distribution weighted hypothesis combination. Our main result shows that,\nremarkably, for any \ufb01xed target function, there exists a distribution weighted combining rule that\nhas a loss of at most \u01eb with respect to any mixture of the k distributions. We also show that there\nexists a distribution weighted combining rule that has loss at most 3\u01eb with respect to any consistent\ntarget function (one for which each hi has loss \u01eb on Di) and any mixture of the k distributions. In\nsome sense, our results establish that the distribution weighted hypothesis combination is the \u201cright\u201d\ncombination rule, and that it also bene\ufb01ts from a well-founded theoretical guarantee.\n\nThe remainder of this paper is organized as follows. Section 2 introduces our theoretical model for\nmultiple source adaptation. In Section 3, we analyze the abstract case where the mixture parameters\nof the target distribution are known and show that the distribution weighted hypothesis combination\nthat uses as weights these mixture coef\ufb01cients achieves a loss of at most \u01eb. In Section 4, we give\na simple method to produce an error of \u0398(k\u01eb) that does not require the prior knowledge of the\nmixture parameters of the target distribution. Our main results showing the existence of a combined\nhypothesis performing well regardless of the target mixture are given in Section 5 for the case of a\n\ufb01xed target function, and in Section 6 for the case of multiple target functions. Section 7 reports\nempirical results for a multiple source adaptation problem with a real-world dataset.\n\n2\n\n\f2 Problem Set-Up\n\nLet X be the input space, f : X \u2192 R the target function to learn, and L : R \u00d7 R \u2192 R a loss function\npenalizing errors with respect to f . The loss of a hypothesis h with respect to a distribution D and\nloss function L is denoted by L(D, h, f ) and de\ufb01ned as L(D, h, f ) = Ex\u223cD[L(h(x), f (x))] =\nPx\u2208X L(h(x), f (x))D(x). We will denote by \u2206 the simplex \u2206 = {\u03bb : \u03bbi \u2265 0 \u2227 Pk\ni=1 \u03bbi = 1} of\nRk.\n\nWe consider an adaptation problem with k source domains and a single target domain. The input\nto the problem is the set of k source distributions D1, . . . , Dk and k corresponding hypotheses\nh1, . . . , hk such that for all i \u2208 [1, k], L(Di, hi, f ) \u2264 \u01eb, for a \ufb01xed \u01eb \u2265 0. The distribution\nof the target domain, DT , is assumed to be a mixture of the k source distributions Dis, that is\nDT (x) = Pk\ni=1 \u03bbiDi(x), for some unknown mixture weight vector \u03bb \u2208 \u2206. The adaptation problem\nconsists of combing the hypotheses his to derive a hypothesis with small loss on the target domain.\nSince the target distribution DT is assumed to be a mixture, we will refer to this problem as the\nmixture adaptation problem.\nA combining rule for the hypotheses takes as an input the his and outputs a single hypothe-\nsis h : X \u2192 R. We de\ufb01ne two combining rules of particular interest for our purpose: the lin-\near combining rule which is based on a parameter z \u2208 \u2206 and which sets the hypothesis to\nh(x) = Pk\ni=1 zihi(x); and the distribution weighted combining rule also based on a parameter\nz \u2208 \u2206 which sets the hypothesis to h(x) = Pk\nj=1 zjDj(x) > 0.\nThis last condition always holds if Di(x) > 0 for all x \u2208 X and some i \u2208 [1, k]. We de\ufb01ne H to\nbe the set of all distribution weighted combining rules. Given the input to the adaptation problem\nwe have implicit information about the target function f . We de\ufb01ne the set of consistent target\nfunctions, F, as follows,\n\nhi(x) when Pk\n\ni=1\n\nziDi(x)\nj=1 zj Dj (x)\n\nPk\n\nF = {g : \u2200i \u2208 [1, k], L(Di, hi, g) \u2264 \u01eb} .\n\nBy de\ufb01nition, the target function f is an element of F.\nWe will assume that the following properties hold for the loss function L: (i) L is non-negative:\nL(x, y) \u2265 0 for all x, y \u2208 R; (ii) L is convex with respect to the \ufb01rst argument: L(Pk\ni=1 \u03bbixi, y) \u2264\nPk\ni=1 \u03bbiL(xi, y) for all x1, . . . , xk, y \u2208 R and \u03bb \u2208 \u2206; (iii) L is bounded: there exists M \u2265 0\nsuch that L(x, y) \u2264 M for all x, y \u2208 R; (iv) L(x, y) is continuous in both x and y; and (v) L is\nsymmetric L(x, y) = L(y, x). The absolute loss de\ufb01ned by L(x, y) = |x \u2212 y| will serve as our\nprimary motivating example.\n\n3 Known Target Mixture Distribution\n\nIn this section we assume that the parameters of the target mixture distribution are known. Thus, the\nlearning algorithm is given \u03bb \u2208 \u2206 such that DT (x) =Pk\ni=1 \u03bbiDi(x). A good starting point would be\nto study the performance of a linear combining rule. Namely the classi\ufb01er h(x) = Pk\ni=1 \u03bbihi(x).\nWhile this seems like a very natural classi\ufb01er, the following example highlights the problematic\naspects of this approach.\nConsider a discrete domain X = {a, b} and two distributions, Da and Db, such that Da(a) = 1\nand Db(b) = 1. Namely, each distribution puts all the weight on a single element in X . Consider\nthe target function f , where f (a) = 1 and f (b) = 0, and let the loss be the absolute loss. Let\nh0 = 0 be the function that outputs 0 for all x \u2208 X and similarly h1 = 1. The hypotheses h1\nand h0 have zero expected absolute loss on the distributions Da and Db, respectively, i.e., \u01eb = 0.\nNow consider the target distribution DT with \u03bba = \u03bbb = 1/2, thus DT (a) = DT (b) = 1/2. The\nhypothesis h(x) = (1/2)h1(x) + (1/2)h0(x) always outputs 1/2, and has an absolute loss of 1/2.\nFurthermore, for any other parameter z of the linear combining rule, the expected absolute loss of\nh(x) = zh1(x)+ (1 \u2212 z)h0(x) with respect to DT is exactly 1/2. We have established the following\ntheorem.\nTheorem 1. There is a mixture adaptation problem with \u01eb = 0 for which any linear combination\nrule has expected absolute loss of 1/2.\n\n3\n\n\fNext we show that the distribution weighted combining rule produces a hypothesis with a low ex-\npected loss. Given a mixture DT (x) = Pk\ni=1 \u03bbiDi(x), we consider the distribution weighted com-\nbining rule with parameter \u03bb, which we denote by h\u03bb. Recall that,\n\nh\u03bb(x) =\n\nk\n\nX\n\ni=1\n\n\u03bbiDi(x)\nj=1 \u03bbjDj(x)\n\nPk\n\nhi(x) =\n\nk\n\nX\n\ni=1\n\n\u03bbiDi(x)\nDT (x)\n\nhi(x) .\n\nUsing the convexity of L with respect to the \ufb01rst argument, the loss of h\u03bb with respect to DT and a\ntarget f \u2208 F can be bounded as follows,\n\nL(DT , h\u03bb, f ) = X\n\nL(h\u03bb(x), f (x))DT (x) \u2264 X\n\nX\n\n\u03bbiDi(x)L(hi(x), f (x)) =\n\nx\u2208X\n\nx\u2208X\n\ni=1\n\nk\n\nk\n\nX\n\ni=1\n\n\u03bbi\u01ebi \u2264 \u01eb,\n\nwhere \u01ebi := L(Di, hi, f ) \u2264 \u01eb. Thus, we have derived the following theorem.\nTheorem 2. For any mixture adaptation problem with target distribution D\u03bb(x) = Pk\ni=1 \u03bbiDi(x),\nthe expected loss of the hypothesis h\u03bb is at most \u01eb with respect to any target function f \u2208 F:\nL(D\u03bb, h\u03bb, f ) \u2264 \u01eb.\n\n4 Simple Adaptation Algorithms\n\nIn this section we show how to construct a simple distribution weighted hypothesis that has an\nexpected loss guarantee with respect to any mixture. Our hypothesis hu is simply based on equal\nweights, i.e., ui = 1/k, for all i \u2208 [1, k]. Thus,\n\nhu(x) =\n\nk\n\nX\n\ni=1\n\n(1/k)Di(x)\nj=1(1/k)Dj(x)\n\nPk\n\nhi(x) =\n\nk\n\nX\n\ni=1\n\nDi(x)\nj=1 Dj(x)\n\nPk\n\nhi(x).\n\nWe show for hu an expected loss bound of k\u01eb, with respect to any mixture distribution DT and target\nfunction f \u2208 F. (Proof omitted.)\nTheorem 3. For any mixture adaptation problem the expected loss of hu is at most k\u01eb, for any\nmixture distribution DT and target function f \u2208 F, i.e., L(DT , hu, f ) \u2264 k\u01eb.\n\nUnfortunately, the hypothesis hu can have an expected absolute loss as large as \u2126(k\u01eb).\nomitted.)\nTheorem 4. There is a mixture adaptation problem for which the expected absolute loss of hu is\n\u2126(k\u01eb). Also, for k = 2 there is an input to the mixture adaptation problem for which the expected\nabsolute loss of hu is 2\u01eb \u2212 \u01eb2.\n\n(Proof\n\n5 Existence of a Good Hypothesis\n\nIn this section, we will show that for any target function f \u2208 F there is a distribution weighted\ncombining rule hz that has a loss of at most \u01eb with respect to any mixture DT . We will construct\nthe proof in two parts. In the \ufb01rst part, we will show, using a simple reduction to a zero-sum game,\nthat one can obtain a mixture of hzs that guarantees a loss bounded by \u01eb. In the second part, which\nis the more interesting scenario, we will show that for any target function f \u2208 F there is a single\ndistribution weighted combining rule hz that has loss of at most \u01eb with respect to any mixture DT .\nThis later part will require the use of Brouwer \ufb01xed point theorem to show the existence of such an\nhz.\n\n5.1 Zero-sum game\n\nThe adaptation problem can be viewed as a zero-sum game between two players, NATURE and\nLEARNER. Let the input to the mixture adaptation problem be D1, . . . , Dk, h1, . . . , hk and \u01eb, and\n\ufb01x a target function f \u2208 F. The player NATURE picks a distribution Di while the player LEARNER\nselects a distribution weighted combining rule hz \u2208 H. The loss when NATURE plays Di and\nLEARNER plays hz is L(Di, hz, f ). Let us emphasize that the target function f \u2208 F is \ufb01xed\nbeforehand. The objective of NATURE is to maximize the loss and the objective of LEARNER is to\nminimize the loss. We start with the following lemma,\n\n4\n\n\fLemma 1. Given any mixed strategy of NATURE, i.e., a distribution \u00b5 over Di\u2019s, then the following\naction of LEARNER h\u00b5 \u2208 H has expected loss at most \u01eb, i.e., L(D\u00b5, h\u00b5, f ) \u2264 \u01eb.\n\nThe proof is identical to that of Theorem 2. This almost establishes that the value of the game is at\nmost \u01eb. The technical part that we need to take care of is the fact that the action space of LEARNER\nis in\ufb01nite. However, by an appropriate discretization of H we can derive the following theorem.\nTheorem 5. For any target function f \u2208 F and any \u03b4 > 0, there exists a function h(x) =\nPm\nj=1 \u03b1jhzj (x), where hzi \u2208 H, such that L(DT , h, f ) \u2264 \u01eb + \u03b4 for any mixture distribution\nDT (x) = Pk\nSince we can \ufb01x \u03b4 > 0 to be arbitrarily small, this implies that a linear mixture of distribution\nweighted combining rules can guarantee a loss of almost \u01eb with respect to any product distribution.\n\ni=1 \u03bbiDi(x).\n\n5.2 Single distribution weighted combining rule\n\nIn the previous subsection, we showed that a mixture of hypotheses in H would guarantee a loss of\nat most \u01eb. Here, we will considerably strengthen the result and show that there is a single hypothesis\nin H for which this guarantee holds. Unfortunately our loss is not convex with respect to h \u2208 H, so\nwe need to resort to a more powerful technique, namely the Brouwer \ufb01xed point theorem.\nFor the proof we will need that the distribution weighted combining rule hz be continuous in\nIn general, this does hold due to the existence of points x \u2208 X for which\nthe parameter z.\nPk\nz, as\nfollows.\nClaim 1. Let U denote the uniform distribution over X , then for any \u03b7 > 0 and z \u2208 \u2206, let\nz : X \u2192 R be the function de\ufb01ned by\nh\u03b7\n\nj=1 zjDj(x) = 0. To avoid this discontinuity, we will modify the de\ufb01nition of hz to h\u03b7\n\nk\n\nh\u03b7\n\nz(x) =\n\nX\n\nThen, for any distribution D, L(D, h\u03b7\n\nziDi(x) + \u03b7U (x)/k\nj=1 zjDj(x) + \u03b7U (x)\n\nPk\n\ni=1\nz , f ) is continuous in z.1\n\nhi(x).\n\nLet us \ufb01rst state Brouwer\u2019s \ufb01xed point theorem.\nTheorem 6 (Brouwer Fixed Point Theorem). For any compact and convex non-empty set A \u2282 Rn\nand any continuous function f : A \u2192 A, there is a point x \u2208 A such that f (x) = x.\n\nz, f ) are all nearly the same.\n\nWe \ufb01rst show that there exists a distribution weighted combining rule h\u03b7\nL(Di, h\u03b7\nLemma 2. For any target function f \u2208 F and any \u03b7, \u03b7\u2032 > 0, there exists z \u2208 \u2206, with zi 6= 0 for all\ni \u2208 [1, k], such that the following holds for the distribution weighted combining rule h\u03b7\n\nz for which the losses\n\nz \u2208 H:\n\nL(Di, h\u03b7\n\nz, f ) = \u03b3 + \u03b7\u2032 \u2212\n\n\u03b7\u2032\nzik\n\n\u2264 \u03b3 + \u03b7\u2032\n\nfor any 1 \u2264 i \u2264 k, where \u03b3 = Pk\n\nj=1 zjL(Dj, h\u03b7\n\nz, f ).\n\ni = L(Di, h\u03b7\n\nz, f ) for all z \u2208 \u2206 and i \u2208 [1, m]. Consider the\nProof. Fix \u03b7\u2032 > 0 and let Lz\nj + \u03b7\u2032),\nmapping \u03c6 : \u2206 \u2192 \u2206 de\ufb01ned for all z \u2208 \u2206 by [\u03c6(z)]i = (ziLz\nwhere [\u03c6(z)]i, is the ith coordinate of \u03c6(x), i \u2208 [1, m]. By Claim 1, \u03c6 is continuous. Thus,\nby Brouwer\u2019s Fixed Point Theorem, there exists z \u2208 \u2206 such that \u03c6(z) = z. This implies that\nj + \u03b7\u2032). Since \u03b7\u2032 > 0, we must have zi 6= 0 for any i \u2208 [1, m]. Thus,\nzi = (ziLz\nj=1 zjLz\ni +\u03b7\u2032/(zik) = (Pk\nwe can divide by zi and write Lz\ni = \u03b3 +\u03b7\u2032\u2212\u03b7\u2032/(zik)\nwith \u03b3 = Pk\n\nj )+\u03b7\u2032. Therefore, Lz\n\ni + \u03b7\u2032/k)/ (Pk\n\ni + \u03b7\u2032/k)/(Pk\n\nj .\nj=1 zjLz\n\nj=1 zjLz\n\nj=1 zjLz\n\n1In addition to continuity, the perturbation to hz, h\u03b7\n\nz , also helps us ensure that none of the mixture weights\n\nzi is zero in the proof of the Lemma 2 .\n\n5\n\n\f= Xx\u2208X\n\u2264 Xx\u2208X k\nXi=1\nXi=1\n\n=\n\nk\n\nziDi(x)L(hi(x), f (x))! + Xx\u2208X\n\n\u03b7M U (x)\n\nziL(Di, hi, f ) + \u03b7M =\n\nzi\u01ebi + \u03b7M \u2264 \u01eb + \u03b7M .\n\nk\n\nXi=1\n\nNote that the lemma just presented does not use the structure of the distribution weighted combining\nrule, but only the fact that the loss is continuous in the parameter z \u2208 \u2206. The lemma applies as well\nto the linear combination rule and provides the same guarantee. The real crux of the argument is, as\nshown in the next lemma, that \u03b3 is small for a distribution weighted combining rule (while it can be\nvery large for a linear combination rule).\nLemma 3. For any target function f \u2208 F and any \u03b7, \u03b7\u2032 > 0, there exists z \u2208 \u2206 such that\nL(D\u03bb, h\u03b7\n\nz, f ) \u2264 \u01eb + \u03b7M + \u03b7\u2032 for any \u03bb \u2208 \u2206.\n\nProof. Let z be the parameter guaranteed in Lemma 2. Then L(Di, h\u03b7\nz , f ) = \u03b3 + \u03b7\u2032 \u2212 \u03b7\u2032/(zik) \u2264\n\u03b3 + \u03b7\u2032, for 1 \u2264 i \u2264 k. Consider the mixture Dz, i.e., set the mixture parameter to be z. Consider the\nquantity L(Dz, h\u03b7\nz, f ) and\nthus L(Dz, h\u03b7\nL(Dz,h\u03b7\nz , f )\n\nz, f ). On the one hand, by de\ufb01nition, L(Dz, h\u03b7\n\nz, f ) = \u03b3. On the other hand,\n\nz, f ) = Pk\n\ni=1 ziL(Di, h\u03b7\n\nDz(x)L(h\u03b7\n\nz (x), f (x)) \u2264 Xx\u2208X\n\nDz(x)\n\nDz(x) + \u03b7U (x) k\nXi=1\n\n(ziDi(x) +\n\n\u03b7U (x)\n\nk\n\n)L(hi(x), f (x))!\n\nTherefore \u03b3 \u2264 \u01eb + \u03b7M . To complete the proof, note that the following inequality holds for any\nmixture D\u03bb:\n\nL(D\u03bb, h\u03b7\n\nz, f ) =\n\nk\n\nX\n\ni=1\n\n\u03bbiL(Di, h\u03b7\n\nz, f ) \u2264 \u03b3 + \u03b7\u2032,\n\nwhich is at most \u01eb + \u03b7M + \u03b7\u2032.\n\nBy setting \u03b7 = \u03b4/(2M ) and \u03b7\u2032 = \u03b4/2, we can derive the following theorem.\nTheorem 7. For any target function f \u2208 F and any \u03b4 > 0, there exists \u03b7 > 0 and z \u2208 \u2206, such that\nL(D\u03bb, h\u03b7\n\nz, f ) \u2264 \u01eb + \u03b4 for any mixture parameter \u03bb.\n\n6 Arbitrary target function\n\nThe results of the previous section show that for any \ufb01xed target function there is a good distribution\nweighted combining rule. In this section, we wish to extend these results to the case where the target\nfunction is not \ufb01xed in advanced. Thus, we seek a single distribution weighted combining rule that\ncan perform well for any f \u2208 F and any mixture D\u03bb. Unfortunately, we are not able to prove a\nbound of \u01eb + o(\u01eb) but only a bound of 3\u01eb. To show this bound we will show that for any f1, f2 \u2208 F\nand any hypothesis h the difference of loss is bounded by at most 2\u01eb.\nLemma 4. Assume that the loss function L obeys the triangle inequality, i.e., L(f, h) \u2264 L(f, g) +\nL(g, h). Then for any f, f \u2032 \u2208 F and any mixture DT , the inequality L(DT , h, f \u2032) \u2264 L(DT , h, f ) +\n2\u01eb holds for any hypothesis h.\n\nProof. Since our loss function obeys the triangle inequality, for any functions f, g, h, the following\nholds, L(D, f, h) \u2264 L(D, f, g) + L(D, g, h). In our case, we observe that replacing g with any\nf \u2032 \u2208 F gives, L(D\u03bb, f, h) \u2264 L(D\u03bb, f \u2032, h) + L(D\u03bb, f, f \u2032). We can bound the term L(D\u03bb, f, f \u2032)\nwith a similar inequality, L(D\u03bb, f, f \u2032) \u2264 L(D\u03bb, f, h\u03bb) + L(D\u03bb, f \u2032, h\u03bb) \u2264 2\u01eb, where h\u03bb is the\ndistribution weighted combining rule produced by choosing z = \u03bb and using Theorem 2. Therefore,\nfor any f, f \u2032 \u2208 F we have, L(D\u03bb, f, h) \u2264 L(D\u03bb, f \u2032, h) + 2\u01eb, which completes the proof.\n\nWe derived the following corollary to Theorem 7.\nCorollary 1. Assume that the loss function L obeys the triangle inequality. Then, for any \u03b4 > 0,\nthere exists \u03b7 > 0 and z \u2208 \u2206, such that for any mixture parameter \u03bb and any f \u2208 F,\nL(D\u03bb, h\u03b7\n\nz, f ) \u2264 3\u01eb + \u03b4.\n\n6\n\n\fE\nS\nM\n\n2.1\n\n2\n\n1.9\n\n1.8\n\n1.7\n\n1.6\n\n1.5\n\n \n\nUniform Mixture Over 4 Domains\n\n \n\nIn\u2212Domain\nOut\u2212Domain\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n(a)\n\nE\nS\nM\n\n2.4\n\n2.2\n\n2\n\n1.8\n\n1.6\n\n \n\n1.4\n0\n\nMixture = \u03b1 book + (1 \u2212 \u03b1) kitchen\n\nMixture = \u03b1 dvd + (1 \u2212 \u03b1) electronics\n\n \n\n \n\nweighted\nlinear\nbook\nkitchen\n\n0.2\n\n0.4\n\n\u03b1\n\n0.6\n\n0.8\n\n1\n\n(b)\n\nE\nS\nM\n\n2.4\n\n2.2\n\n2\n\n1.8\n\n1.6\n\n \n\n1.4\n0\n\nweighted\nlinear\ndvd\nelectronics\n\n0.2\n\n0.4\n\n\u03b1\n\n0.6\n\n0.8\n\n1\n\nFigure 1: (a) MSE performance for a target mixture of four domains (1: books, 2: dvd, 3: electronics,\n4: kitchen 5: linear, 6: weighted). (b) MSE performance under various mixtures of two source\ndomains, plot left: book and kitchen, plot right: dvd and electronics.\n\n7 Empirical results\n\nThis section reports the results of our experiments with a distribution weighted combining rule using\nreal-world data. In our experiments, we \ufb01xed a mixture target distribution D\u03bb and considered the\ndistribution weighted combining rule hz, with z = \u03bb. Since we used real-world data, we did not have\naccess to the domain distributions. Instead, we modeled each distribution and used large amounts\nof unlabeled data available for each source to estimate the model\u2019s parameters. One could have thus\nexpected potentially signi\ufb01cantly worse empirical results than the theoretical ones, but this turned\nout not to be an issue in our experiments.\nWe used the sentiment analysis dataset found in [4].2 The data consists of review text and rat-\ning labels, taken from amazon.com product reviews within four different categories (domains).\nThese four domains consist of book, dvd, electronics and kitchen reviews, where each do-\nmain contains 2000 data points. 3 In our experiments, we \ufb01xed a mixture target distribution D\u03bb and\nconsidered the distribution weighted combining rule hz, with z = \u03bb.\nIn our \ufb01rst experiment, we considered mixtures of all four domains, where the test set was a uniform\nmixture of 600 points, that is the union of 150 points taken uniformly at random from each domain.\nThe remaining 1,850 points from each domain were used to train the base hypotheses.4 We com-\npared our proposed weighted combining rule to the linear combining rule. The results are shown\nin Figure 1(a). They show that the base hypotheses perform poorly on the mixture test set, which\njusti\ufb01es the need for adaptation. Furthermore, the distribution weighted combining rule is shown to\nperform at least as well as the worst in-domain performance of a base hypothesis, as expected from\nour bounds. Finally, we observe that this real-world data experiment gives an example in which a\nlinear combining rule performs poorly compared to the distribution weighted combining rule.\n\nIn other experiments, we considered the mixture of two domains, where the mixture is varied ac-\ncording to the parameter \u03b1 \u2208 {0.1, 0.2, . . . , 1.0}. For each plot in Figure 1 (b), the test set consists\nof 600\u03b1 points from the \ufb01rst domain and 600(1 \u2212 \u03b1) points from the second domain, where the\n\ufb01rst and second domains are made clear in the \ufb01gure. The remaining points that were not used for\ntesting were used to train the base hypotheses. The results show the linear shift from one domain to\nthe other, as is evident from the performance of the two base hypotheses. The distribution weighted\ncombining rule outperforms the base hypotheses as well as the linear combining rule.\n\n2http://www.seas.upenn.edu/\u02dcmdredze/datasets/sentiment/.\n3The rating label, an integer between 1 and 5, was used as a regression label, and the loss measured by the\nmean squared error (MSE). All base hypotheses were generated using Support Vector Regression (SVR) [17]\nwith the trade-off parameters C = 8, \u01eb = 0.1, and a Gaussian kernel with parameter g = 0.00078. The SVR\nsolutions were obtained using the libSVM software library ( http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm/).\nOur features were de\ufb01ned as the set of unigrams appearing \ufb01ve times or more in all domains. This de\ufb01ned\nabout 4000 unigrams. We used a binary feature vector encoding the presence or absence of these frequent\nunigrams to de\ufb01ne our instances. To model the domain distributions, we used a unigram statistical language\nmodel trained on the same corpus as the one used to de\ufb01ne the features. The language model was created using\nthe GRM library (http://www.research.att.com/\u02dcfsmtools/grm/).\n\n4Each experiment was repeated 20 times with random folds. The standard deviation found was far below\n\nwhat could be legibly displayed in the \ufb01gures.\n\n7\n\n\fThus, our preliminary experiments suggest that the distribution weighted combining rule performs\nwell in practice and clearly outperforms a simple linear combining rule. Furthermore, using statis-\ntical language models as approximations to the distribution oracles seem to be suf\ufb01cient in practice\nand can help produce a good distribution weighted combining rule.\n\n8 Conclusion\n\nWe presented a theoretical analysis of the problem of adaptation with multiple sources. Domain\nadaptation is an important problem that arises in a variety of modern applications where limited or\nno labeled data is available for a target application and our analysis can be relevant in a variety of\nsituations. The theoretical guarantees proven for the distribution weight combining rule provide it\nwith a strong foundation. Its empirical performance with a real-world data set further motivates\nits use in applications. Much of the results presented were based on the assumption that the target\ndistribution is some mixture of the source distributions. A further analysis suggests however that\nour main results can be extended to arbitrary target distributions.\n\nAcknowledgments\n\nWe thank Jennifer Wortman for helpful comments on an earlier draft of this paper and Ryan McDonald for\ndiscussions and pointers to data sets. The work of M. Mohri and A. Rostamizadeh was partly supported by the\nNew York State Of\ufb01ce of Science Technology and Academic Research (NYSTAR).\n\nReferences\n[1] Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for\n\ndomain adaptation. In Proceedings of NIPS 2006. MIT Press, 2007.\n\n[2] Jacob Benesty, M. Mohan Sondhi, and Yiteng Huang, editors. Springer Handbook of Speech Processing.\n\nSpringer, 2008.\n\n[3] John Blitzer, Koby Crammer, A. Kulesza, Fernando Pereira, and Jennifer Wortman. Learning bounds for\n\ndomain adaptation. In Proceedings of NIPS 2007. MIT Press, 2008.\n\n[4] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders:\n\nDomain Adaptation for Sentiment Classi\ufb01cation. In ACL 2007, Prague, Czech Republic, 2007.\n\n[5] Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from Data of Variable Quality. In\n\nProceedings of NIPS 2005, 2006.\n\n[6] Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from multiple sources. In Proceedings\n\nof NIPS 2006, 2007.\n\n[7] Mark Dredze, John Blitzer, Pratha Pratim Talukdar, Kuzman Ganchev, Joao Graca, and Fernando Pereira.\n\nFrustratingly Hard Domain Adaptation for Parsing. In CoNLL 2007, Prague, Czech Republic, 2007.\n\n[8] Jean-Luc Gauvain and Chin-Hui. Maximum a posteriori estimation for multivariate gaussian mixture\nobservations of markov chains. IEEE Transactions on Speech and Audio Processing, 2(2):291\u2013298, 1994.\n\n[9] Frederick Jelinek. Statistical Methods for Speech Recognition. The MIT Press, 1998.\n[10] Jing Jiang and ChengXiang Zhai. Instance Weighting for Domain Adaptation in NLP. In Proceedings of\n\nACL 2007, pages 264\u2013271, Prague, Czech Republic, 2007. Association for Computational Linguistics.\n\n[11] C. J. Legetter and Phil C. Woodland. Maximum likelihood linear regression for speaker adaptation of\n\ncontinuous density hidden markov models. Computer Speech and Language, pages 171\u2013185, 1995.\n\n[12] Aleix M. Mart\u00b4\u0131nez. Recognizing imprecisely localized, partially occluded, and expression variant faces\n\nfrom a single sample per class. IEEE Trans. Pattern Anal. Mach. Intell., 24(6):748\u2013763, 2002.\n\n[13] S. Della Pietra, V. Della Pietra, R. L. Mercer, and S. Roukos. Adaptive language modeling using minimum\ndiscriminant estimation. In HLT \u201991: Proceedings of the workshop on Speech and Natural Language,\npages 103\u2013106, Morristown, NJ, USA, 1992. Association for Computational Linguistics.\n\n[14] Brian Roark and Michiel Bacchiani. Supervised and unsupervised PCFG adaptation to novel domains. In\n\nProceedings of HLT-NAACL, 2003.\n\n[15] Roni Rosenfeld. A Maximum Entropy Approach to Adaptive Statistical Language Modeling. Computer\n\nSpeech and Language, 10:187\u2013228, 1996.\n\n[16] Leslie G. Valiant. A theory of the learnable. ACM Press New York, NY, USA, 1984.\n[17] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.\n\n8\n\n\f", "award": [], "sourceid": 198, "authors": [{"given_name": "Yishay", "family_name": "Mansour", "institution": null}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": null}, {"given_name": "Afshin", "family_name": "Rostamizadeh", "institution": null}]}