{"title": "Algorithms and Theory for Multiple-Source Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 8246, "page_last": 8256, "abstract": "We present a number of novel contributions to the multiple-source adaptation problem. We derive new normalized solutions with strong theoretical guarantees for the cross-entropy loss and other similar losses. We also provide new guarantees that hold in the case where the conditional probabilities for the source domains are distinct. Moreover, we give new algorithms for determining the distribution-weighted combination solution for the cross-entropy loss and other losses. We report the results of a series of experiments with real-world datasets. We find that our algorithm outperforms competing approaches by producing a single robust model that performs well on any target mixture distribution. Altogether, our theory, algorithms, and empirical results provide a full solution for the multiple-source adaptation problem with very practical benefits.", "full_text": "Algorithms and Theory for\nMultiple-Source Adaptation\n\nJudy Hoffman\n\nCS Department UC Berkeley\n\nBerkeley, CA 94720\n\njhoffman@eecs.berkeley.edu\n\nMehryar Mohri\n\nCourant Institute and Google\n\nNew York, NY 10012\nmohri@cims.nyu.edu\n\nNingshan Zhang\n\nNew York University\nNew York, NY 10012\n\nnzhang@stern.nyu.edu\n\nAbstract\n\nWe present a number of novel contributions to the multiple-source adaptation\nproblem. We derive new normalized solutions with strong theoretical guarantees\nfor the cross-entropy loss and other similar losses. We also provide new guarantees\nthat hold in the case where the conditional probabilities for the source domains\nare distinct. Moreover, we give new algorithms for determining the distribution-\nweighted combination solution for the cross-entropy loss and other losses. We\nreport the results of a series of experiments with real-world datasets. We \ufb01nd that\nour algorithm outperforms competing approaches by producing a single robust\nmodel that performs well on any target mixture distribution. Altogether, our theory,\nalgorithms, and empirical results provide a full solution for the multiple-source\nadaptation problem with very practical bene\ufb01ts.\n\n1\n\nIntroduction\n\nIn many modern applications, often the learner has access to information about several source\ndomains, including accurate predictors possibly trained and made available by others, but no direct\ninformation about a target domain for which one wishes to achieve a good performance. The target\ndomain can typically be viewed as a combination of the source domains, that is a mixture of their\njoint distributions, or it may be close to such mixtures. In addition, often the learner does not have\naccess to all source data simultaneously, for legitimate reasons such as privacy or storage limitation.\nThus, the learner cannot simply pool all source data together to learn a predictor.\nSuch problems arise commonly in speech recognition where different groups of speakers (domains)\nyield different acoustic models and the problem is to derive an accurate acoustic model for a broader\npopulation that may be viewed as a mixture of the source groups (Liao, 2013). In object recognition,\nmultiple image databases exist, each with its own bias and labeled categories (Torralba and Efros,\n2011), but the target application may contain images which most closely resemble only a subset of\nthe available training data. Finally, in sentiment analysis, accurate predictors may be available for\nsub-domains such as TVs, laptops and CD players, each previously trained on labeled data, but no\nlabeled data or predictor may be at the learner\u2019s disposal for the more general category of electronics,\nwhich can be modeled as a mixture of the sub-domains (Blitzer et al., 2007; Dredze et al., 2008).\nThe problem of transfer from a single source to a known target domain (Ben-David, Blitzer, Crammer,\nand Pereira, 2006; Mansour, Mohri, and Rostamizadeh, 2009b; Cortes and Mohri, 2014; Cortes,\nMohri, and Mu\u00f1oz Medina, 2015), either through unsupervised adaptation techniques (Gong et al.,\n2012; Long et al., 2015; Ganin and Lempitsky, 2015; Tzeng et al., 2015), or via lightly supervised\nones (some amount of labeled data from the target domain) (Saenko et al., 2010; Yang et al., 2007;\nHoffman et al., 2013; Girshick et al., 2014), has been extensively investigated in the past. Here, we\nfocus on the problem of multiple-source domain adaptation and ask how the learner can combine\nrelatively accurate predictors available for each source domain to derive an accurate predictor for\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fany new mixture target domain? This is known as the multiple-source adaptation (MSA) problem\n\ufb01rst formalized and analyzed theoretically by Mansour, Mohri, and Rostamizadeh (2008, 2009a)\nand later studied for various applications such as object recognition (Hoffman et al., 2012; Gong\net al., 2013a,b). Recently, Zhang et al. (2015) studied a causal formulation of this problem for a\nclassi\ufb01cation scenario, using the same combination rules as Mansour et al. (2008, 2009a). A closely\nrelated problem to the MSA problem is that of domain generalization (Pan and Yang, 2010; Muandet\net al., 2013; Xu et al., 2014), where knowledge from an arbitrary number of related domains is\ncombined to perform well on a previously unseen domain. Appendix G includes a more detailed\ndiscussion of previous work related to the MSA problem.\nMansour, Mohri, and Rostamizadeh (2008, 2009a) gave strong theoretical guarantees for a distribution-\nweighted combination to address the MSA problem, but they did not provide an algorithmic solution\nto determine that combination. Furthermore, the solution they proposed could not be used for loss\nfunctions such as cross-entropy, which require a normalized predictor. Their work also assumed a\ndeterministic scenario (non-stochastic) with the same labeling function for all source domains.\nThis work makes a number of novel contributions to the MSA problem. We give new normalized\nsolutions with strong theoretical guarantees for the cross-entropy loss and other similar losses. Our\nguarantees hold even when the conditional probabilities for the source domains are distinct. A\nby-product of our analysis is the extension of the theoretical results of Mansour et al. (2008, 2009a)\nto the stochastic scenario, where there is a joint distribution over the input and output space.\nMoreover, we give new algorithms for determining the distribution-weighted combination solution\nfor the cross-entropy loss and other losses. We prove that the problem of determining that solution\ncan be cast as a DC-programming (difference of convex) and prove explicit DC-decompositions for\nthe cross-entropy loss and other losses. We also give a series of experimental results with several\ndatasets demonstrating that our distribution-weighted combination solution is remarkably robust. Our\nalgorithm outperforms competing approaches and performs well on any target mixture distribution.\nAltogether, our theory, algorithms, and empirical results provide a full solution for the MSA problem\nwith very practical bene\ufb01ts.\n\n2 Problem setup\n\nextension is needed for the analysis of the most common and realistic learning setups in practice. We\n\nadaptation (MSA) problem in the general stochastic scenario where there is a distribution over the\n\nLetX denote the input space andY the output space. We consider a multiple-source domain\njoint input-output spaceX\u00d7Y. This is a more general setup than the deterministic scenario in\n(Mansour et al., 2008, 2009a), where a target function mapping fromX toY is assumed. This\nwill assume thatX andY are discrete, but the predictors we consider can take real values. Our theory\nthe proofs. We will identify a domain with a distribution overX\u00d7Y and consider the scenario where\nthe learner has access to a predictor hk, for each domain Dk, k\u2208[p]={1, . . . , p}.\n\nWe consider two types of predictor functions hk, and their associated loss functions L under the\nregression model (R) and the probability model (P) respectively,\n\ncan be straightforwardly extended to the continuous case with summations replaced by integrals in\n\n2\n\n(R)\n(P)\n\nhk\u2236X\u2192 R\nL\u2236 R\u00d7Y\u2192 R+\nhk\u2236X\u00d7Y\u2192[0, 1] L\u2236[0, 1]\u2192 R+\nE(x,y)\u223cD\uffffL(h, x, y)\uffff= \uffff(x,y)\u2208X\u00d7Y\n\nMuch of our theory only assumes that L is convex and continuous. But, we will be particularly\n\nWe abuse the notation and write L(h, x, y) to denote the loss of a predictor h at point(x, y), that\nis L(h(x), y) in the regression model, and L(h(x, y)) in the probability model. We will denote by\nL(D, h) the expected loss of a predictor h with respect to the distribution D:\nD(x, y)L(h, x, y).\ninterested in the case where, in the regression model, L(h(x), y)=(h(x)\u2212 y)2 is the squared loss,\nand where, in the probability model, L(h(x, y))=\u2212 log h(x, y) is the cross-entropy loss (log-loss).\n\u270f> 0 such thatL(Dk, hk)\u2264 \u270f for all k \u2208[p]. We will also assume that the loss of the source\nhypotheses hk is bounded, that is L(hk, x, y)\u2264 M for all(x, y)\u2208X\u00d7Y and all k\u2208[p].\n\nWe will assume that each hk is a relatively accurate predictor for the distribution Dk: there exists\n\nL(D, h)=\n\n\fIn the MSA problem, the learner\u2019s objective is to combine these predictors to design a predictor with\nsmall expected loss on a target domain that could be an arbitrary and unknown mixture of the source\ndomains, the case we are particularly interested in, or even some other arbitrary distribution. It is\nworth emphasizing that the learner has no knowledge of the target domain.\n\nIn Appendix A (Lemmas 9 and 10) we show that no convex combination rule will perform well even\nin very simple MSA problems. These results generalize a previous lower bound of Mansour et al.\n(2008). Next, we show that the distribution-weighted combination rule is a suitable solution.\nExtending the de\ufb01nition given by Mansour et al. (2008), we de\ufb01ne the distribution-weighted combi-\n\nHow do we combine the hks? Can we use a convex combination rule,\u2211p\nnation of the functions hk, k\u2208[p] as follows. For any \u2318> 0, z\u2208 , and(x, y)\u2208X\u00d7Y,\n\nk=1 khk, for some \u2208 ?\n\n(R)\n\n(P)\n\n(1)\n\n(2)\n\nh\u2318\n\nz(x)= p\uffffk=1\nz(x, y)= p\uffffk=1\n\nh\u2318\n\nzkD1\n\nk(x)+ \u2318 U1(x)p\nk(x)+ \u2318U1(x) hk(x),\n\u2211p\nk=1 zkD1\nzkDk(x, y)+ \u2318 U(x,y)p\nj=1 zjDj(x, y)+ \u2318 U(x, y) hk(x, y),\n\u2211p\n\nwhere we denote by D1 the marginal distribution overX , for all x\u2208X , D1(x)=\u2211y\u2208Y D(x, y),\nand by U1 the uniform distribution overX . This extension may seem technically straightforward in\n\nhindsight, but the form of the predictor was not immediately clear in the stochastic case.\n\n3 Theoretical guarantees\n\nIn this section, we present a series of theoretical guarantees for distribution-weighted combinations\nwith a suitable choice of the parameters z and \u2318, both for the regression model and for the probability\nmodel. We \ufb01rst give our main result for the general stochastic scenario. Next, for the probability\nmodel with cross-entropy loss, we introduce a normalized distribution weighted combination and\nprove that it bene\ufb01ts from strong theoretical guarantees.\nOur theoretical results rely on a measure of divergence between two distributions. The one that\nnaturally comes up in our analysis is the R\u00e9nyi Divergence (R\u00e9nyi, 1961). We will denote by\n\nSee Appendix F for more details about the notion of R\u00e9nyi Divergence.\n\n3.1 General guarantees for regression and probability models\n\nprobability distribution on the target and the source domain k respectively. We do not assume that the\n\nThis is a signi\ufb01cant extension of the MSA scenario with respect to the one considered by Mansour\net al. (2009a), which assumed exactly the same labeling function f for all source domains, in the\ndeterministic scenario.\nLet DT be a mixture of source distributions, such that D1\n\nd\u21b5(D\u2225 D\u2032)= eD\u21b5(D\u2225D\u2032) the exponential of the \u21b5-R\u00e9nyi Divergence of two distributions D and D\u2032.\nLet DT be an unknown target distribution. We will denote by DT(\u22c5\uffffx) and Dk(\u22c5\uffffx) the conditional\ntarget and source conditional probabilities DT(\u22c5\uffffx) and Dk(\u22c5\uffffx) coincide for all k\u2208[p] and x\u2208X .\nk\u2236 \u2208 } in the\nk=1 kDk\u2236 \u2208 } in the probability model. We also assume\nregression model, or DT \u2208D={\u2211p\nFix \u21b5> 1 and de\ufb01ne \u270fT by\nk\uffffd\u21b5(DT(\u22c5\uffffx)\u2225 Dk(\u22c5\uffffx))\u21b5\u22121\uffff\uffff\nk\u2208[p]\uffff E\n\u270fT= max\nx\u223cD1\ndistribution DT(\u22c5\uffffx) and the source ones Dk(\u22c5\uffffx),\u2200k\u2208[p], with the expectation taken over the\nk, and the maximum taken over k\u2208[p]. When the target conditional is\nconditional probabilities coincide, for \u21b5=+\u221e, we have \u270fT= \u270f.\n\nsource marginal distribution D1\nclose to all source ones, \u21b5 can be chosen to be very large and \u270fT is close to \u270f. In particular, when the\n\nthat under the regression model, all possible target distributions DT admit the same (unknown)\nconditional probability distribution.\n\n\u270fT depends on the maximal expected R\u00e9nyi divergence between the target conditional probability\n\nT \u2208D1 ={\u2211p\n\nk=1 kD1\n\n\u21b5\u22121\n\u21b5 M\n\n\u270f\n\n1\n\n\u21b5 .\n\n1\n\u21b5\n\n3\n\n\fTheorem 1. For any > 0, there exist \u2318> 0 and z\u2208 such that the following inequalities hold for\nany \u21b5> 1 and any target distribution DT that is a mixture of source distributions:\n\nL(DT , h\u2318\nL(DT , h\u2318\n\nz)\u2264 \u270fT+ ,\nz)\u2264 \u270f+ .\n\n(R)\n(P)\n\n1\n\u21b5\n\n\u270f\n\n1\n\n\u21b5 .\n\n\u21b5\u22121\n\u21b5 M\n\n\u270f\u21b5(Q)= max\n\nAs discussed later, the proof of more general results (Theorem 2 and Theorem 14) is given in\nAppendix B. The learning guarantees for the regression and the probability model are slightly\ndifferent, since the de\ufb01nitions of the distribution-weighted combinations are different for the two\n\nsmall. It is even more remarkable that, in the probability model (P), the loss of h\u2318\n\nz on DT is\nz is at most \u270f on\nz is a robust hypothesis with favorable property for any such\n\nmodels. Theorem 1 shows the existence of \u2318> 0 and a mixture weight z\u2208 with a remarkable\nproperty: in the regression model (R), for any target distribution DT whose conditional DT(\u22c5\uffffx) is\non average not too far away from Dk(\u22c5\uffffx) for any k\u2208[p], and D1\nany target distribution DT\u2208D. Thus, h\u2318\n\ntarget distribution DT .\nWe now present a more general result, Theorem 2, that relaxes the assumptions under the regression\nmodel that all possible target distributions DT admit the same conditional probability distribution,\nand that the target\u2019s marginal distribution is a mixture of source ones. In Appendix B, we show that\nTheorem 2 coincides with Theorem 1 under those assumptions. In Appendix B, we further give a\nmore general result than Theorem 1 under the probability model (Theorem 14).\nTo present this more general result, we \ufb01rst introduce some additional notation. Given a conditional\n\nT \u2208D1, the loss of h\u2318\n\nprobability distribution Q(\u22c5\uffffx) de\ufb01ned for all x\u2208X , de\ufb01ne \u270f\u21b5(Q) as follows:\nk\uffffd\u21b5(Q(\u22c5\uffffx)\u2225 Dk(\u22c5\uffffx))\u21b5\u22121\uffff\uffff\nk\u2208[p]\uffff E\nx\u223cD1\nThus, \u270f\u21b5(Q) depends on the maximal expected \u21b5-R\u00e9nyi divergence between Q(\u22c5\uffffx) and Dk(\u22c5\uffffx),\nand \u270f\u21b5(Q)= \u270fT when Q(\u22c5\uffffx)= DT(\u22c5\uffffx). When there exists Q(\u22c5\uffffx) such that the expected \u21b5-R\u00e9nyi\ndivergence is small for all k\u2208[p], then \u270f\u21b5(Q) is close to \u270f for \u21b5=+\u221e. In addition, we will use the\nk(x)Q(y\uffffx) andDP,Q=\uffff\u2211p\nfollowing de\ufb01nitions: Dk,Q(x, y)= D1\nTheorem 2 (Regression model). Fix a conditional probability distribution Q(\u22c5\uffffx) de\ufb01ned for all\nx\u2208X . Then, for any > 0, there exist \u2318> 0 and z\u2208 such that the following inequality holds for\nany \u21b5, > 1 and any target distribution DT :\nz)\u2264\uffff\uffff\u270f\u21b5(Q)+ \uffff d(DT\u2225DP,Q)\uffff \u22121\nprobabilities of the source and target domains and a \ufb01xed pivot Q(\u22c5\uffffx). In particular, when there exists\na pivot Q(\u22c5\uffffx) that is close to DT(\u22c5\uffffx) and Dk(\u22c5\uffffx), for all k\u2208[p], then the guarantee is signi\ufb01cant.\nOne candidate for such a pivot is a conditional probability distribution Q(\u22c5\uffffx) minimizing \u270f\u21b5(Q).\n\nIn many learning tasks, it is reasonable to assume that the conditional probability of the output\nlabels is the same across source domains. For example, a dog picture represents a dog regardless\nof whether the picture belongs to an individual\u2019s personal collection or to a broader database of\npictures from multiple individuals. This is a straightforward extension of the assumption adopted by\nMansour et al. (2008) in the deterministic scenario, where exactly the same labeling function f is\n\nThe learning guarantee of Theorem 2 depends on the R\u00e9nyi divergence between the conditional\n\nk=1 kDk,Q\u2236 \u2208 \uffff.\n\nassumed for all source domains. In that case, we have DT(\u22c5\uffffx)= Dk(\u22c5\uffffx),\u2200k\u2208[p] and therefore\nd\u21b5(DT(\u22c5\uffffx)\u2225 Dk(\u22c5\uffffx))= 1. Setting \u21b5=+\u221e, we recover the main result of Mansour et al. (2008).\nCorollary 3. Assume that the conditional probability distributions Dk(\u22c5\uffffx) do not depend on k. Then,\nfor any > 0, there exist \u2318> 0 and z\u2208 such thatL(D, h\u2318\nz)\u2264 \u270f+ for any mixture parameter\n\u2208 .\nCorollary 3 shows the existence of a parameter \u2318> 0 and a mixture weight z\u2208 with a remarkable\nproperty: for any > 0, regardless of which mixture weight \u2208 de\ufb01nes the target distribution, the\nz is at most \u270f+ , that is arbitrarily close to \u270f. h\u2318\ndistributions Dk are not directly available to the learner, and instead estimates\u0302Dk have been derived\n\nloss of h\u2318\nfavorable property for any mixture target distribution.\nTo cover the realistic cases in applications, we further extend this result to the case where the\n\nz is therefore a robust hypothesis with a\n\nL(DT , h\u2318\n\n\n\nM\n\n1\n\n .\n\n4\n\n\f\u21b5\n\n1\n\nz(x, y).\nz(x, y) de\ufb01ned by (2):\n\n\u21b5 ,\n\nM\n\nR\u00e9nyi divergence of DT and the family of mixtures of source distributions.\n\nfrom data, and further to the case where the target distribution DT is not a mixture of source\nz the distribution-weighted combination rule based on the estimates\n\ndistributions. We will denote by\u0302h\u2318\n\u0302Dk. Our learning guarantee for\u0302h\u2318\nz depends on the R\u00e9nyi divergence of\u0302Dk and Dk, as well as the\nCorollary 4. For any > 0, there exist \u2318> 0 and z\u2208 , such that the following inequality holds for\nany \u21b5> 1 and arbitrary target distribution DT :\nz)\u2264\uffff(\u0302\u270f+ ) d\u21b5(DT\u2225 \u0302D)\uffff \u21b5\u22121\nL(DT ,\u0302h\u2318\nwhere\u0302\u270f= maxk\u2208[p]\uffff\u270f d\u21b5(\u0302Dk\u2225 Dk)\uffff \u21b5\u22121\nz based on the estimate distributions \u0302Dk that is\nCorollary 4 shows that there exists a predictor\u0302h\u2318\n\u0302\u270f-accurate with respect to any target distribution DT whose R\u00e9nyi divergence with respect to the\nfamily \u0302D is not too large (d\u21b5(DT\u2225 \u0302D) close to 1). Furthermore,\u0302\u270f is close to \u270f, provided that\u0302Dks\nare good estimates of Dks (that is d\u21b5(\u0302Dk\u2225 Dk) close to 1). The proof is given in Appendix B.\n\n3.2 Guarantees for the probability model with the cross-entropy loss\nHere, we discuss the important special case where L coincides with the cross-entropy loss in the\nprobability model, and present a guarantee for a normalized distribution-weighted combination\nsolution. This analysis is a complement to Theorem 1, which only holds for the unnormalized\nhypothesis h\u2318\nThe cross-entropy loss assumes normalized hypotheses. Thus, here, we assume that the source\n\nk=1 k\u0302Dk\u2236 \u2208 \uffff.\n\n\u21b5 , and \u0302D=\uffff\u2211p\n\nM 1\n\n\u21b5\n\n\u2318\n\n\u2318\n\nh\n\nh\u2318\n\ncombination h\u2318\n\nz(x, y)=\n\nfunctions are normalized for every x: \u2211y\u2208Y hk(x, y)= 1, \u2200x\u2208X ,\u2200k\u2208[p]. For any \u2318> 0 and\nz(x, y) that is based on distribution-weighted\nz\u2208 , we de\ufb01ne a normalized weighted combination h\nz(x, y)\nz(x, y) .\n\u2211y\u2208Y h\u2318\nWe will \ufb01rst assume the conditional probability distributions Dk(\u22c5\uffffx) do not depend on k.\nTheorem 5. Assume that there exists \u00b5> 0 such that Dk(x, y)\u2265 \u00b5U(x, y) for all k \u2208[p] and\n(x, y)\u2208X\u00d7Y. Then, for any > 0, there exist \u2318> 0 and z\u2208 such thatL(D, h\nz)\u2264 \u270f+ for any\nmixture parameter \u2208 .\n\nTheorem 5 provides a strong guarantee that is the analogue of Corollary 3 for normalized distribution-\nweighted combinations. The theorem can also be extended to the case of arbitrary target distributions\nand estimated densities. When the conditional probabilities are distinct across the source domains,\nwe propose a marginal distribution-weighted combination rule, which is already normalized. We\ncan directly apply Theorem 1 to that solution and achieve favorable guarantees. More details are\npresented in Appendix C.\nThese results are non-trivial and important, as they provide a guarantee for an accurate and robust\npredictor for a commonly used loss function, the cross-entropy loss.\n\n\u2318\n\n4 Algorithms\n\nWe have shown that, for both the regression and the probability model, there exists a vector z de\ufb01ning\nz that admits very favorable guarantees. But how\na distribution-weighted combination hypothesis h\u2318\ncan we \ufb01nd a such z? This is a key question in the MSA problem which was not addressed by Mansour\net al. (2008, 2009a): no algorithm was previously reported to determine the mixture parameter z,\neven for the deterministic scenario. Here, we give an algorithm for determining that vector z.\nIn this section, we give practical and ef\ufb01cient algorithms for \ufb01nding the vector z in the important\ncases of the squared loss in the regression model, or the cross-entropy loss in the probability model, by\nleveraging the differentiability of the loss functions. We \ufb01rst show that z is the solution of a general\noptimization problem. Next, we give a DC-decomposition (difference of convex decomposition)\n\n5\n\n\fof the objective for both models, thereby proving an explicit DC-programming formulation of the\nproblem. This leads to an ef\ufb01cient DC algorithm that is guaranteed to converge to a stationary point.\nAdditionally, we show that it is straightforward to test if the solution obtained is the global optimum.\nWhile we are not proving that the local stationary point found by our algorithm is the global optimum,\nempirically, we observe that that is indeed the case.\n\n4.1 Optimization problem\nTheorem 1 shows that the hypothesis h\u2318\ngeneralization guarantee. A key step in proving Theorem 1 is to show the following lemma.\n\nz based on the mixture parameter z bene\ufb01ts from a strong\n\nLemma 6. For any \u2318, \u2318\u2032> 0, there exists z\u2208 , with zk\u2260 0 for all k\u2208[p], such that the following\n\nholds for the distribution-weighted combining rule h\u2318\nz:\n\n\u2200k\u2208[p], L(Dk, h\u2318\n\nz)+ \u2318\u2032.\nzjL(Dj, h\u2318\nz)\u2212L(Dz, h\u2318\nbe formulated as a min-max problem: minz\u2208 maxk\u2208[p]L(Dk, h\u2318\n\nequivalently formulated as the following optimization problem:\n\nLemma 6 indicates that for the solution z, h\u2318\nz has essentially the same loss on all source domains.\nThus, our problem consists of \ufb01nding a parameter z verifying this property. This, in turn, can\n\nz), which can be\n\nz)\u2264 p\uffffj=1\n\n(3)\n\n\n\nmin\n\nz\u2208,\u2208R\n\ns.t.L(Dk, h\u2318\n\nz)\u2212L(Dz, h\u2318\n\nz)\u2264 ,\u2200k\u2208[p].\n\n4.2 DC-decomposition\nWe provide explicit DC decompositions of the objective of Problem (4) for the regression model with\nthe squared loss and for the probability model with the cross-entropy loss. The derivations are given\nin Appendix D. We \ufb01rst rewrite h\u2318\nz as the division of two af\ufb01ne functions for both the regression (R)\n\nand the probability (P) model, hz= Jz\uffffKz, where we adopt the following de\ufb01nitions and notation:\nJz(x)= p\uffffk=1\nJz(x, y)= p\uffffk=1\nProposition 7 (Regression model, squared loss). Let L be the squared loss. Then, for any k\u2208[p],\nL(Dk, h\u2318\nz)= uk(z)\u2212 vk(z), where uk and vk are convex functions de\ufb01ned for all z by\n\nU(x, y)hk(x, y), Kz(x, y)= Dz(x, y)+ \u2318 U(x, y).\n\nz(x)+ \u2318 U1(x),\n\nKz(x)= D1\n\n(R)\n\n(P)\n\np\n\n(4)\n\np\n\nzkD1\n\nU1(x)hk(x),\n\nk(x)hk(x)+ \u2318\nzkDk(x, y)hk(x, y)+ \u2318\nz)\u2212L(Dz, h\u2318\nuk(z)=L\uffffDk+ \u2318U1Dk(\u22c5\uffffx), h\u2318\nvk(z)=L\uffffDz+ \u2318U1Dk(\u22c5\uffffx), h\u2318\nz)\u2212L(Dz, h\u2318\n\nz\uffff\u2212 2M\uffffx(D1\nz\uffff\u2212 2M\uffffx(D1\n\nk+ \u2318U1)(x) log Kz(x),\nk+ \u2318U1)(x) log Kz(x).\n\nProposition 8 (Probability model, cross-entropy loss). Let L be the cross-entropy loss. Then, for\n\nz)= uk(z)\u2212 vk(z), where uk and vk are convex functions de\ufb01ned for\n\nall z by\n\nk\u2208[p],L(Dk, h\u2318\nuk(z)=\u2212\uffffx,y\uffffDk(x, y)+ \u2318 U(x, y)\uffff log Jz(x, y),\nvk(z)=\uffffx,y\n\nKz(x, y) log\uffff Kz(x, y)\n\nJz(x, y)\uffff\u2212[Dk(x, y)+ \u2318 U(x, y)] log Kz(x, y).\n\n4.3 DC algorithm\nOur DC decompositions prove that the optimization problem (4) can be cast as the following\nvariational form of a DC-programming problem (Tao and An, 1997, 1998; Sriperumbudur and\nLanckriet, 2012):\n\n\nmin\n\ns.t.\uffffuk(z)\u2212 vk(z)\u2264 \uffff\u2227\uffff\u2212 zk\u2264 0\uffff\u2227\uffff\u2211p\n\nk=1 zk\u2212 1= 0\uffff, \u2200k\u2208[p].\n\nz\u2208,\u2208R\n\n(5)\n\n6\n\n\fFigure 1: MSE sentiment analysis under mixture of two domains:\nelectronics; (b) (right \ufb01gure) kitchen and books.\n\n(a) (left \ufb01gure) dvd and\n\nsolving the following convex optimization problem:\n\n\n\n(6)\n\nof Problem (4) (Yuille and Rangarajan, 2003; Sriperumbudur and Lanckriet, 2012). Note that\n\nThe DC-programming algorithm works as follows. Let(zt)t be the sequence de\ufb01ned by repeatedly\nzt+1\u2208 argmin\nz,\u2208R\ns.t.\uffffuk(z)\u2212 vk(zt)\u2212(z\u2212 zt)\u2207vk(zt)\u2264 \uffff\u2227\uffff\u2212 zk\u2264 0\uffff\u2227\uffff\u2211p\nk=1 zk\u2212 1= 0\uffff, \u2200k\u2208[p],\nwhere z0\u2208 is an arbitrary starting value. Then,(zt)t is guaranteed to converge to a local minimum\nProblem (6) is a relatively simple optimization problem: uk(z) is a weighted sum of the negative\nProblem (4) seeks a parameter z verifyingL(Dk, h\u2318\nz) \u2264 , for all k \u2208 [p] for an\narbitrarily small value of . SinceL(Dz, h\u2318\nz) is a weighted average of the\nz)=\u2211p\nexpected lossesL(Dk, h\u2318\nz), k\u2208[p], the solution cannot be negative. Furthermore, by Lemma 6, a\nparameter z verifying that inequality exists for any > 0. Thus, the global solution of Problem (4)\n\nmust be close to zero. This provides us with a simple criterion for testing the global optimality of the\nsolution z we obtain using a DC-programming algorithm with a starting parameter z0.\n\nlogarithm of an af\ufb01ne function of z, plus a weighted sum of rational functions of z (squared loss),\nand all other terms appearing in the constraints are af\ufb01ne functions of z.\n\nz)\u2212L(Dz, h\u2318\nk=1 zkL(Dk, h\u2318\n\n5 Experiments\n\nThis section reports the results of our experiments with our DC-programming algorithm for \ufb01nding\na robust domain generalization solution when using squared loss and cross-entropy loss. We \ufb01rst\nevaluated our algorithm using an arti\ufb01cial dataset assuming known densities where we could compare\nour result to the global solution and found that indeed our global objective approached the known\noptimum of zero (see Appendix E for more details). Next, we evaluated our DC-programming\nsolution applied to real-world datasets: a sentiment analysis dataset (Blitzer et al., 2007) with the\nsquared loss, a visual domain adaptation benchmark dataset Of\ufb01ce (Saenko et al., 2010), as well as a\ngeneralization of digit recognition task, with the cross-entropy loss.\nFor all real-world datasets, the probability distributions Dk are not readily available to the learner.\nHowever, Corollary 4 extends the learning guarantees of our solution to the case where an estimate\n\n\u0302Dk is used in lieu of the ideal distribution Dk. Thus, we used standard density estimation methods to\nderive an estimate\u0302Dk for each k\u2208[p]. While density estimation can be a dif\ufb01cult task in general,\nfor our purpose, straightforward techniques were suf\ufb01cient for our predictor\u0302h\u2318\n\nz to achieve a high\nperformance, since the approximate densities only serve to indicate the relative importance of each\nsource domain. We give full details about our density estimation procedure in Appendix E.\n\n5.1 Sentiment analysis task with the squared loss\nWe used the sentiment analysis dataset proposed by Blitzer et al. (2007) and used for multiple-source\nadaptation by Mansour et al. (2008, 2009a). This dataset consists of product review text and rating\nlabels taken from four domains: books (B), dvd (D), electronics (E), and kitchen (K), with\n2,000 samples for each domain. We de\ufb01ned a vocabulary of 2,500 words that occur at least twice in\nthe intersection of the four domains. These words were used to de\ufb01ne feature vectors, where every\nsample was encoded by the number of occurrences of each word. We trained our base hypotheses\nusing support vector regression with the same hyper-parameters as in (Mansour et al., 2008, 2009a).\n\n7\n\n\fTable 1: MSE on the sentiment analysis dataset of source-only baselines for each domain, K,D, B,E,\nthe uniform weighted predictor unif, KMM, and the distribution-weighted method DW based on the\nlearned z. DW outperforms all competing baselines.\n\nK\nD\nB\nE\nunif\nKMM\nDW(ours)\n\nK\n\n1.46\u00b10.08\n2.12\u00b10.08\n2.18\u00b10.11\n1.69\u00b10.09\n1.62\u00b10.05\n1.63\u00b10.15\n1.45\u00b10.08\n\nD\n\n2.20\u00b10.14\n1.78\u00b10.08\n2.01\u00b10.09\n2.31\u00b10.12\n1.84\u00b10.09\n2.07\u00b10.12\n1.78\u00b10.08\n\nB\n\n2.29\u00b10.13\n2.12\u00b10.08\n1.73\u00b10.12\n2.40\u00b10.11\n1.86\u00b10.09\n1.93\u00b10.17\n1.72\u00b10.12\n\nE\n\n1.69\u00b10.12\n2.10\u00b10.07\n2.24\u00b10.07\n1.50\u00b10.06\n1.62\u00b10.07\n1.69\u00b10.12\n1.49\u00b10.06\n\nTable 2: Digit dataset statistics.\n\nSVHN MNIST USPS\n\n# train images\n# test images\nimage size\ncolor\n\n73,257\n26,032\n32x32\nrgb\n\n60,000\n10,000\n28x28\ngray\n\n7,291\n2,007\n16x16\ngray\n\nTest Data\n\nKD\n\n1.83\u00b10.08\n1.95\u00b10.07\n2.10\u00b10.09\n2.00\u00b10.09\n1.73\u00b10.06\n1.83\u00b10.07\n1.62\u00b10.07\n\nDBE\n\nKDB\n\nBE\n\n1.99\u00b10.10\n2.11\u00b10.07\n1.99\u00b10.08\n1.95\u00b10.07\n1.74\u00b10.07\n1.82\u00b10.07\n1.61\u00b10.08\n\nKDB\n\nKBE\n\n1.81\u00b10.07\n2.11\u00b10.06\n2.05\u00b10.06\n1.86\u00b10.04\n1.70\u00b10.05\n1.75\u00b10.07\n1.56\u00b10.04\n\n2.06\u00b10.07\n1.98\u00b10.06\n2.00\u00b10.06\n2.01\u00b10.06\n1.99\u00b10.05\n1.98\u00b10.06\n2.07\u00b10.06\n2.14\u00b10.06\n1.77\u00b10.05\n1.77\u00b10.04\n1.89\u00b10.07\n1.86\u00b10.09\n1.66\u00b10.05\n1.65\u00b10.04\nTable 3: Digit dataset accuracy.\n\n1.78\u00b10.07\n2.00\u00b10.06\n2.14\u00b10.06\n1.84\u00b10.06\n1.69\u00b10.04\n1.78\u00b10.06\n1.58\u00b10.05\n\nKDBE\n\n1.91\u00b10.06\n2.03\u00b10.06\n2.04\u00b10.05\n1.98\u00b10.05\n1.74\u00b10.04\n1.82\u00b10.06\n1.61\u00b10.04\n\nCNN-s\nCNN-m\nCNN-u\nCNN-unif\nDW (ours)\nCNN-joint\n\nsvhn\n92.3\n15.7\n16.7\n75.7\n91.4\n90.9\n\nmnist\n66.9\n99.2\n62.3\n91.3\n98.8\n99.1\n\nusps\n65.6\n79.7\n96.6\n92.2\n95.6\n96.0\n\nTest Data\nmu\n66.7\n96.0\n68.1\n91.4\n98.3\n98.6\n\nsu\n90.4\n20.3\n22.5\n76.9\n91.7\n91.3\n\nsm\n85.2\n38.9\n29.4\n80.0\n93.5\n93.2\n\nsmu mean\n78.8\n84.2\n41.0\n55.8\n46.9\n32.9\n84.0\n80.7\n94.7\n93.6\n93.3\n94.6\n\nWe compared our method (DW) against each source hypothesis, hk. We also computed a privileged\n\nk=1 khk. -comb is of course not accessible\n\nbaseline using the oracle mixing parameter, -comb:\u2211p\n\nin practice since the target mixture is not known to the user. We also compared against a previously\nproposed domain adaptation algorithm (Huang et al., 2006) known as KMM. It is important to note\nthat the KMM model requires access to the unlabeled target data during adaptation and learns a new\npredictor for every target domain, while DW does not use any target data. Thus KMM operates in a\nfavorable learning setting when compared to our solution.\nWe \ufb01rst considered the same test scenario as in (Mansour et al., 2008), where the target is a mixture of\ntwo source domains. The plots of Figures 1a and 1b report the results of our experiments. They show\nthat our distribution-weighted predictor DW outperforms all baseline predictors despite the privileged\nlearning scenarios of -comb and KMM. We also compared our results with the weighted predictor\nused in the empirical studies by Mansour et al. (2008), which is not a realistic solution since it is using\nthe unknown target mixture as z to compute hz. Nevertheless, we observed that the performance\nof this \u201dcheating\u201d solution almost always coincides with that of our DW algorithm and thus did not\ninclude it in our plots and tables to avoid confusion.\nNext, we compared the performance of DW with accessible baseline predictors on various target\nmixtures. Since is not known in practice, we replaced -comb with the uniform combination of all\n\nhypotheses (unif),\u2211p\n\nk=1 hk\uffffp. Table 1 reports the mean and standard deviations of MSE over 10\n\nrepetitions. Each column corresponds to a different target test data source. Our distribution-weighted\nmethod DW outperforms all baseline predictors across all test domains. Observe that, even when the\ntarget is a single source domain, our method successfully outperforms the predictor which is trained\nand tested on the same domain. Results on more target mixtures are available in Appendix E.\n\n5.2 Recognition tasks with the cross-entropy loss\n\nWe considered two real-world domain adaptation tasks: a generalization of a digit recognition task\nand a standard visual adaptation Of\ufb01ce dataset.\nFor each individual domain, we trained a convolutional neural network (CNN) and used the output\nfrom the softmax score layer as our base predictors hk. We computed the uniformly weighted\n\ncombination of source predictors, hunif=\u2211p\n\nk=1 hk\uffffp. As a privileged baseline, we also trained a\n\nmodel on all source data combined, hjoint. Note, this approach is often not feasible if independent\nentities contribute classi\ufb01ers and densities, but not full training datasets. Thus this approach is not\nconsistent with our scenario, and it operates in a much more favorable learning setting than our\nsolution. Finally, our distribution weighted predictor DW was computed with hks, density estimates,\nand our learned weighting, z. Our baselines then consists of the classi\ufb01ers from hk, hunif, hjoint,\nand DW.\n\n8\n\n\fTable 4: Of\ufb01ce dataset accuracy: We report accuracy across six possible test domains. We show\nperformance all baselines: CNN-a,w,d, CNN-unif, DW based on the learned z, and the jointly\ntrained model CNN-joint. DW outperforms all competing models.\n\nCNN-a\nCNN-w\nCNN-d\nCNN-unif\nDW (ours)\nCNN-joint\n\namazon\n\n75.7\u00b1 0.3\n45.3\u00b1 0.5\n50.4\u00b1 0.4\n69.7\u00b1 0.3\n75.2\u00b1 0.4\n72.1\u00b1 0.3\n\nwebcam\n\n53.8\u00b1 0.7\n91.1\u00b1 0.8\n89.6\u00b1 0.9\n93.1\u00b1 0.6\n93.7\u00b1 0.6\n93.7\u00b1 0.5\n\ndslr\n\n53.4\u00b1 1.3\n91.7\u00b1 1.2\n90.9\u00b1 0.8\n93.2\u00b1 0.9\n94.0\u00b1 1.0\n93.7\u00b1 0.5\n\naw\n\n71.4\u00b1 0.3\n54.4\u00b1 0.5\n58.3\u00b1 0.4\n74.4\u00b1 0.4\n78.9\u00b1 0.4\n76.4\u00b1 0.4\n\nTest Data\n\nad\n\n73.5\u00b1 0.2\n50.0\u00b1 0.5\n54.6\u00b1 0.4\n72.1\u00b1 0.3\n77.2\u00b1 0.4\n76.4\u00b1 0.4\n\nwd\n\n53.6\u00b1 0.8\n91.3\u00b1 0.8\n90.0\u00b1 0.7\n93.1\u00b1 0.5\n93.8\u00b1 0.6\n93.7\u00b1 0.5\n\nawd\n\n69.9\u00b1 0.3\n57.5\u00b1 0.4\n61.0\u00b1 0.4\n75.9\u00b1 0.3\n80.2\u00b1 0.3\n79.3\u00b1 0.4\n\nmean\n\n64.5\u00b1 0.6\n68.8\u00b1 0.7\n70.7\u00b1 0.6\n81.6\u00b1 0.5\n84.7\u00b1 0.5\n83.6\u00b1 0.4\n\nWe began our study with a generalization of digit recognition task, which consists of three digit\nrecognition datasets: Google Street View House Numbers (SVHN), MNIST, and USPS. Dataset\nstatistics as well as example images can be found in Table 2. We trained the ConvNet (or CNN)\narchitecture following Taigman et al. (2017) as our source models and joint model. We used the\nsecond fully-connected layer\u2019s output as our features for density estimation, and the output from\nthe softmax score layer as our predictors. We used the full training sets per domain to learn the\nsource model and densities. Note, these steps are completely isolated from one another and may be\nperformed by unique entities and in parallel. Finally, for our DC-programming algorithm we used a\nsmall subset of 200 real image-label pairs from each domain to learn the parameter z.\nOur next experiment used the standard visual adaptation Of\ufb01ce dataset, which has 3 domains: amazon,\nwebcam, and dslr. The dataset contains 31 recognition categories of objects commonly found in an\nof\ufb01ce environment. There are 4,110 images total with 2,817 from amazon, 795 from webcam, and\n498 from dslr.\nWe followed the standard protocol from Saenko et al. (2010), whereby 20 labeled examples are\navailable for training from the amazon domain and 8 labeled examples are available from both the\nwebcam and dslr domains. The remaining examples from each domain are used for testing. We\nused the AlexNet Krizhevsky et al. (2012) ConvNet (CNN) architecture, and used the output from\nthe softmax score layer as our base predictors, pre-trained on ImageNet and used fc7 activations as\nour features for density estimation Donahue et al. (2014).\nWe report the performance of our algorithm and that of baselines on the digit recognition dataset in\nTable 3, and report the performance on the Of\ufb01ce dataset in Table 4. On both datasets, we evaluated\non various test distributions: each individual domain, the combination of each two domains and the\nfully combined set. When the test distribution equals one of the source distributions, our distribution-\nweighted classi\ufb01er successfully outperforms (webcam,dslr) or maintains the performance of the\nclassi\ufb01er which is trained and tested on the same domain. For the more realistic scenario where the\ntarget domain is a mixture of any two or all three source domains, the performance of our method\nis comparable or marginally superior to that of the jointly trained network, despite the fact that\nwe do not retrain any network parameters in our method and that we only use a small number of\nper-domain examples to learn the distribution weights \u2013 an optimization which may be solved on a\nsingle CPU in a matter of seconds for this problem. This again demonstrates the robustness of our\ndistribution-weighted combined classi\ufb01er to a varying target domain.\n\n6 Conclusion\n\nWe presented practically applicable multiple-source domain adaptation algorithms for the squared\nloss and the cross-entropy loss. Our algorithms bene\ufb01t from a series of very favorable theoretical\nguarantees. Our results further demonstrate empirically their effectiveness and their importance in\nadaptation problems in practice.\n\nAcknowledgments\nWe thank Cyril Allauzen for comments on a previous draft of this paper. This work was partly funded\nby NSF CCF-1535987 and NSF IIS-1618662.\n\n9\n\n\fReferences\nC. Arndt. Information Measures: Information and its Description in Science and Engineering.\n\nSignals and Communication Technology. Springer Verlag, 2004.\n\nS. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain\n\nadaptation. In NIPS, pages 137\u2013144, 2006.\n\nG. Blanchard, G. Lee, and C. Scott. Generalizing from several related classi\ufb01cation tasks to a new\n\nunlabeled sample. In NIPS, pages 2178\u20132186, 2011.\n\nJ. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain\n\nadaptation for sentiment classi\ufb01cation. In ACL, pages 440\u2013447, 2007.\n\nC. Cortes and M. Mohri. Domain adaptation and sample bias correction theory and algorithm for\n\nregression. Theor. Comput. Sci., 519:103\u2013126, 2014.\n\nC. Cortes, M. Mohri, and A. Mu\u00f1oz Medina. Adaptation algorithm and theory based on generalized\n\ndiscrepancy. In KDD, pages 169\u2013178, 2015.\n\nT. M. Cover and J. M. Thomas. Elements of Information Theory. Wiley-Interscience, 2006.\nK. Crammer, M. J. Kearns, and J. Wortman. Learning from multiple sources. Journal of Machine\n\nLearning Research, 9(Aug):1757\u20131774, 2008.\n\nJ. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep\nconvolutional activation feature for generic visual recognition. In ICML, volume 32, pages 647\u2013655,\n2014.\n\nM. Dredze, K. Crammer, and F. Pereira. Con\ufb01dence-weighted linear classi\ufb01cation. In ICML, volume\n\n307, pages 264\u2013271, 2008.\n\nL. Duan, I. W. Tsang, D. Xu, and T. Chua. Domain adaptation from multiple sources via auxiliary\n\nclassi\ufb01ers. In ICML, volume 382, pages 289\u2013296, 2009.\n\nL. Duan, D. Xu, and I. W. Tsang. Domain adaptation from multiple sources: A domain-dependent\nregularization approach. IEEE Transactions on Neural Networks and Learning Systems, 23(3):\n504\u2013518, 2012.\n\nY. Ganin and V. S. Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML,\n\nvolume 37, pages 1180\u20131189, 2015.\n\nR. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object\n\ndetection and semantic segmentation. In CVPR, pages 580\u2013587, 2014.\n\nB. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation.\n\nIn CVPR, pages 2066\u20132073, 2012.\n\nB. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: Discriminatively learning\nIn ICML, volume 28, pages\n\ndomain-invariant features for unsupervised domain adaptation.\n222\u2013230, 2013a.\n\nB. Gong, K. Grauman, and F. Sha. Reshaping visual datasets for domain adaptation. In NIPS, pages\n\n1286\u20131294, 2013b.\n\nJ. Hoffman, B. Kulis, T. Darrell, and K. Saenko. Discovering latent domains for multisource domain\n\nadaptation. In ECCV, volume 7573, pages 702\u2013715, 2012.\n\nJ. Hoffman, E. Rodner, J. Donahue, K. Saenko, and T. Darrell. Ef\ufb01cient learning of domain-invariant\n\nimage representations. In ICLR, 2013.\n\nJ. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Sch\u00f6lkopf. Correcting sample selection\n\nbias by unlabeled data. In NIPS, pages 601\u2013608, 2006.\n\nA. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba. Undoing the damage of dataset\n\nbias. In ECCV, volume 7572, pages 158\u2013171, 2012.\n\n10\n\n\fA. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, pages 1106\u20131114, 2012.\n\nH. Liao. Speaker adaptation of context dependent deep neural networks. In ICASSP, pages 7947\u20137951,\n\n2013.\n\nM. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation\n\nnetworks. In ICML, volume 37, pages 97\u2013105, 2015.\n\nY. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In NIPS,\n\npages 1041\u20131048, 2008.\n\nY. Mansour, M. Mohri, and A. Rostamizadeh. Multiple source adaptation and the R\u00e9nyi divergence.\n\nIn UAI, pages 367\u2013374, 2009a.\n\nY. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms.\n\nIn COLT, 2009b.\n\nK. Muandet, D. Balduzzi, and B. Sch\u00f6lkopf. Domain generalization via invariant feature representa-\n\ntion. In ICML, volume 28, pages 10\u201318, 2013.\n\nS. J. Pan and Q. Yang. A survey on transfer learning. IEEE Trans. Knowl. Data Eng., 22(10):\n\n1345\u20131359, 2010.\n\nZ. Pei, Z. Cao, M. Long, and J. Wang. Multi-adversarial domain adaptation.\n\n3934\u20133941, 2018.\n\nIn AAAI, pages\n\nA. R\u00e9nyi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium\n\non Mathematical Statistics and Probability, pages 547\u2013561, 1961.\n\nB. Roark, R. Sproat, C. Allauzen, M. Riley, J. Sorensen, and T. Tai. The opengrm open-source\n\n\ufb01nite-state grammar software libraries. In ACL (System Demonstrations), pages 61\u201366, 2012.\n\nK. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In\n\nECCV, volume 6314, pages 213\u2013226, 2010.\n\nB. K. Sriperumbudur and G. R. G. Lanckriet. A proof of convergence of the concave-convex\n\nprocedure using Zangwill\u2019s theory. Neural Computation, 24(6):1391\u20131407, 2012.\n\nY. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. In ICLR, 2017.\nP. D. Tao and L. T. H. An. Convex analysis approach to DC programming: theory, algorithms and\n\napplications. Acta Mathematica Vietnamica, 22(1):289\u2013355, 1997.\n\nP. D. Tao and L. T. H. An. A DC optimization algorithm for solving the trust-region subproblem.\n\nSIAM Journal on Optimization, 8(2):476\u2013505, 1998.\n\nA. Torralba and A. A. Efros. Unbiased look at dataset bias. In CVPR, pages 1521\u20131528, 2011.\nE. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and\n\ntasks. In ICCV, pages 4068\u20134076, 2015.\n\nZ. Xu, W. Li, L. Niu, and D. Xu. Exploiting low-rank structure from latent domains for domain\n\ngeneralization. In ECCV, volume 8691, pages 628\u2013643, 2014.\n\nJ. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms.\n\nIn ACM Multimedia, pages 188\u2013197, 2007.\n\nA. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15(4):915\u2013936,\n\n2003.\n\nK. Zhang, M. Gong, and B. Sch\u00f6lkopf. Multi-source domain adaptation: A causal view. In AAAI,\n\npages 3150\u20133157, 2015.\n\n11\n\n\f", "award": [], "sourceid": 5034, "authors": [{"given_name": "Judy", "family_name": "Hoffman", "institution": "FAIR and Georgia Tech"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Inst. of Math. Sciences & Google Research"}, {"given_name": "Ningshan", "family_name": "Zhang", "institution": "NYU"}]}