{"title": "Joint distribution optimal transportation for domain adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 3730, "page_last": 3739, "abstract": "This paper deals with the unsupervised domain adaptation problem, where one wants to estimate a prediction function $f$ in a given target domain without any labeled sample by exploiting the knowledge available from a source domain where labels are known. Our work makes the following assumption: there exists a non-linear transformation between the joint feature/label space distributions of the two domain $\\ps$ and $\\pt$. We propose a solution of this problem with optimal transport, that allows to recover an estimated target $\\pt^f=(X,f(X))$ by optimizing simultaneously the optimal coupling and $f$. We show that our method corresponds to the minimization of a bound on the target error, and provide an efficient algorithmic solution, for which convergence is proved. The versatility of our approach, both in terms of class of hypothesis or loss functions is demonstrated with real world classification and regression problems, for which we reach or surpass state-of-the-art results.", "full_text": "Joint distribution optimal transportation for domain\n\nadaptation\n\nNicolas Courty\u2217\n\nUniversit\u00e9 de Bretagne Sud,\nIRISA, UMR 6074, CNRS,\n\ncourty@univ-ubs.fr\n\nR\u00e9mi Flamary\u2217\n\nUniversit\u00e9 C\u00f4te d\u2019Azur,\n\nLagrange, UMR 7293 , CNRS, OCA\n\nremi.flamary@unice.fr\n\nAmaury Habrard\n\nUniv Lyon, UJM-Saint-Etienne, CNRS,\nLab. Hubert Curien UMR 5516, F-42023\namaury.habrard@univ-st-etienne.fr\n\nAlain Rakotomamonjy\nNormandie Universite\n\nUniversit\u00e9 de Rouen, LITIS EA 4108\n\nalain.rakoto@insa-rouen.fr\n\nAbstract\n\nThis paper deals with the unsupervised domain adaptation problem, where one\nwants to estimate a prediction function f in a given target domain without any\nlabeled sample by exploiting the knowledge available from a source domain where\nlabels are known. Our work makes the following assumption: there exists a non-\nlinear transformation between the joint feature/label space distributions of the two\ndomain Ps and Pt that can be estimated with optimal transport. We propose a\nsolution of this problem that allows to recover an estimated target P f\nt = (X, f (X))\nby optimizing simultaneously the optimal coupling and f. We show that our method\ncorresponds to the minimization of a bound on the target error, and provide an\nef\ufb01cient algorithmic solution, for which convergence is proved. The versatility of\nour approach, both in terms of class of hypothesis or loss functions is demonstrated\nwith real world classi\ufb01cation and regression problems, for which we reach or\nsurpass state-of-the-art results.\n\n1\n\nIntroduction\n\nIn the context of supervised learning, one generally assumes that the test data is a realization of the\nsame process that generated the learning set. Yet, in many practical applications it is often not the\ncase, since several factors can slightly alter this process. The particular case of visual adaptation [1]\nin computer vision is a good example: given a new dataset of images without any label, one may want\nto exploit a different annotated dataset, provided that they share suf\ufb01cient common information and\nlabels. However, the generating process can be different in several aspects, such as the conditions and\ndevices used for acquisition, different pre-processing, different compressions, etc. Domain adaptation\ntechniques aim at alleviating this issue by transferring knowledge between domains [2]. We propose\nin this paper a principled and theoretically founded way of tackling this problem.\nThe domain adaptation (DA) problem is not new and has received a lot of attention during the past ten\nyears. State-of-the-art methods are mainly differing by the assumptions made over the change in data\ndistributions. In the covariate shift assumption, the differences between the domains are characterized\nby a change in the feature distributions P(X), while the conditional distributions P(Y |X) remain\nunchanged (X and Y being respectively the instance and label spaces). Importance re-weighting can\nbe used to learn a new classi\ufb01er (e.g. [3]), provided that the overlapping of the distributions is large\n\n\u2217Both authors contributed equally.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fenough. Kernel alignment [4] has also been considered for the same purpose. Other types of method,\ndenoted as Invariant Components by Gong and co-authors [5], are looking for a transformation\nT such that the new representations of input data are matching, i.e. Ps(T (X)) = Pt(T (X)).\nMethods are then differing by: i) The considered class of transformation, that are generally de\ufb01ned\nas projections (e.g. [6, 7, 8, 9, 5]), af\ufb01ne transform [4] or non-linear transformation as expressed by\nneural networks [10, 11] ii) The types of divergences used to compare Ps(T (X)) and Pt(T (X)),\nsuch as Kullback Leibler [12] or Maximum Mean Discrepancy [9, 5]. Those divergences usually\nrequire that the distributions share a common support to be de\ufb01ned. A particular case is found in\nthe use of optimal transport, introduced for domain adaptation by [13, 14]. T is then de\ufb01ned to be\na push-forward operator such that Ps(X) = Pt(T (X)) and that minimizes a global transportation\neffort or cost between distributions. The associated divergence is the so-called Wasserstein metric,\nthat has a natural Lagrangian formulation and avoids the estimation of continuous distribution by\nmeans of kernel. As such, it also alleviates the need for a shared support.\nThe methods discussed above implicitly assume that the conditional distributions are unchanged by\nT , i.e. Ps(Y |T (X)) \u2248 Pt(Y |T (X)) but there is no clear reason for this assumption to hold. A\nmore general approach is to adapt both marginal feature and conditional distributions by minimizing\na global divergence between them. However, this task is usually hard since no label is available in\nthe target domain and therefore no empirical version Pt(Y |X) can be used. This was achieved by\nrestricting to speci\ufb01c class of transformation such as projection [9, 5].\nContributions and outline. In this work we propose a novel framework for unsupervised domain\nadaptation between joint distributions. We propose to \ufb01nd a function f that predicts an output\nvalue given an input x \u2208 X , and that minimizes the optimal transport loss between the joint source\ndistribution Ps and an estimated target joint distribution P f\nt = (X, f (X)) depending on f (detailed\nin Section 2). The method is denoted as JDOT for \u201cJoint Distribution Optimal Transport\" in the\nremainder. We show that the resulting optimization problem stands for a minimization of a bound\non the target error of f (Section 3) and propose an ef\ufb01cient algorithm to solve it (Section 4). Our\napproach is very general and does not require to learn explicitly a transformation, as it directly solves\nfor the best function. We show that it can handle both regression and classi\ufb01cation problems with a\nlarge class of functions f including kernel machines and neural networks. We \ufb01nally provide several\nnumerical experiments on real regression and classi\ufb01cation problems that show the performances of\nJDOT over the state-of-the-art (Section 5).\n\n2 Joint distribution Optimal Transport\nLet \u2126 \u2208 Rd be a compact input measurable space of dimension d and C the set of labels. P(\u2126)\ndenotes the set of all the probability measures over \u2126. The standard learning paradigm assumes\nclassically the existence of a set of data Xs = {xs\ni}Ns\ni=1 associated with a set of class label information\nYs = {ys\ni}Nt\ni=1 (the\ntesting set). In order to determine the set of labels Yt associated with Xt , one usually relies on an\nempirical estimate of the joint probability distribution P(X, Y ) \u2208 P(\u2126 \u00d7 C) from (Xs, Ys), and\nthe assumption that Xs and Xt are drawn from the same distribution \u00b5 \u2208 P(\u2126). In the considered\nadaptation problem, one assumes the existence of two distinct joint probability distributions Ps(X, Y )\nand Pt(X, Y ) which correspond respectively to two different source and target domains. We will\nwrite \u00b5s and \u00b5t their respective marginal distributions over X.\n\ni \u2208 C (the learning set), and a data set with unknown labels Xt = {xt\n\ni}Ns\ni=1, ys\n\n2.1 Optimal transport in domain adaptation\nThe Monge problem is seeking for a map T0 : \u2126 \u2192 \u2126 that pushes \u00b5s toward \u00b5t de\ufb01ned as:\n\n(cid:90)\n\nT\n\n\u2126\n\nd(x,T (x))d\u00b5s(x),\nwhere T #\u00b5s the image measure of \u00b5s by T , verifying:\n\nT0 = argmin\n\ns.t. T #\u00b5s = \u00b5t,\n\nT #\u00b5s(A) = \u00b5t(T \u22121(A)), \u2200 Borel subset A \u2282 \u2126,\n\n(1)\nand d : \u2126 \u00d7 \u2126 \u2192 R+ is a metric. In the remainder, we will always consider without further\nnoti\ufb01cation the case where d is the squared Euclidean metric. When T0 exists, it is called an optimal\ntransport map, but it is not always the case (e.g. assume that \u00b5s is de\ufb01ned by one Dirac measure and\n\n2\n\n\f(cid:90)\n\n\u00b5t by two). A relaxed version of this problem has been proposed by Kantorovitch [15], who rather\nseeks for a transport plan (or equivalently a joint probability distribution) \u03b3 \u2208 P(\u2126 \u00d7 \u2126) such that:\n\n\u03b30 = argmin\n\u03b3\u2208\u03a0(\u00b5s,\u00b5t)\n\n\u2126\u00d7\u2126\n\nd(x1, x2)d\u03b3(x1, x2),\n\n(2)\n\nwhere \u03a0(\u00b5s, \u00b5t) = {\u03b3 \u2208 P(\u2126 \u00d7 \u2126)|p+#\u03b3 = \u00b5s, p\u2212#\u03b3 = \u00b5t} and p+ and p\u2212 denotes the two\nmarginal projections of \u2126 \u00d7 \u2126 to \u2126. Minimizers of this problem are called optimal transport plans.\nShould \u03b30 be of the form (id \u00d7 T )#\u00b5s, then the solution to Kantorovich and Monge problems\ncoincide. As such the Kantorovich relaxation can be seen as a generalization of the Monge problem,\nwith less constraints on the existence and uniqueness of solutions [16].\nOptimal transport has been used in DA as a principled way to bring the source and target distribution\ncloser [13, 14, 17], by seeking for a transport plan between the empirical distributions of Xs and\nXt and interpolating Xs thanks to a barycentric mapping [14], or by estimating a mapping which is\nnot the solution of Monge problem but allows to map unseen samples [17]. Moreover, they show\nthat better constraining the structure of \u03b3 through entropic or classwise regularization terms helps in\nachieving better empirical results.\n\n2.2\n\nJoint distribution optimal transport loss\n\nThe main idea of this work is is to handle a change in both marginal and conditional distributions.\nAs such, we are looking for a transformation T that will align directly the joint distributions Ps and\nPt. Following the Kantovorich formulation of (2), T will be implicitly expressed through a coupling\nbetween both joint distributions as:\n\n(cid:90)\n\n\u03b30 = argmin\n\n\u03b3\u2208\u03a0(Ps,Pt)\n\n(\u2126\u00d7C)2\n\nD(x1, y1; x2, y2)d\u03b3(x1, y1; x2, y2),\n\n(3)\n\nwhere D(x1, y1; x2, y2) = \u03b1d(x1, x2) + L(y1, y2) is a joint cost measure combining both the\ndistances between the samples and a loss function L measuring the discrepancy between y1 and y2.\nWhile this joint cost is speci\ufb01c (separable), we leave for future work the analysis of generic joint cost\nfunction. Putting it in words, matching close source and target samples with similar labels costs few.\n\u03b1 is a positive parameter which balances the metric in the feature space and the loss. As such, when\n\u03b1 \u2192 +\u221e, this cost is dominated by the metric in the input feature space, and the solution of the\ncoupling problem is the same as in [14]. It can be shown that a minimizer to (3) always exists and is\nunique provided that D(\u00b7) is lower semi-continuous (see [18], Theorem 4.1), which is the case when\nd(\u00b7) is a norm and for every usual loss functions [19].\nIn the unsupervised DA problem, one does not have access to labels in the target domain, and as such\nit is not possible to \ufb01nd the optimal coupling. Since our goal is to \ufb01nd a function on the target domain\nf : \u2126 \u2192 C, we suggest to replace y2 by a proxy f (x2). This leads to the de\ufb01nition of the following\njoint distribution that uses a given function f as a proxy for y:\n\n(cid:80)Nt\n\nP f\nt = (x, f (x))x\u223c\u00b5t\nIn practice we consider empirical versions of Ps and P f\n\ni,f (xt\n\ni=1 \u03b4xt\n\nt =\ni). \u03b3 is then a matrix which belongs to \u2206 , i.e.the transportation polytope of non-\n1\nNt\nnegative matrices between uniform distributions. Since our goal is to estimate a prediction f on\nthe target domain, we propose to \ufb01nd the one that produces predictions that match optimally source\nlabels to the aligned target instances in the transport plan. For this purpose, we propose to solve the\nfollowing problem for JDOT:\n\ni=1 \u03b4xs\n\ni ,ys\ni\n\nt , i.e. \u02c6Ps = 1\n\nNs\n\n(cid:80)Ns\n\n(4)\n\nand \u02c6P f\n\nmin\nf,\u03b3\u2208\u2206\n\nD(xs\n\ni , ys\n\ni ; xt\n\nj, f (xt\n\nj))\u03b3ij \u2261 min\n\nW1( \u02c6Ps,\n\n\u02c6P f\nt )\n\n(5)\n\nf\n\n(cid:88)\n\nij\n\nwhere W1 is the 1-Wasserstein distance for the loss D(x1, y1; x2, y2) = \u03b1d(x1, x2) + L(y1, y2).\nWe will make clear in the next section that the function f we retrieve is theoretically sound with\nrespect to the target error. Note that in practice we add a regularization term for function f in order\nto avoid over\ufb01tting as discussed in Section 4. An illustration of JDOT for a regression problem is\ngiven in Figure 1. In this \ufb01gure, we have very different joint and marginal distributions but we want\n\n3\n\n\fFigure 1: Illustration of JDOT on a 1D regression problem. (left) Source and target empirical\ndistributions and marginals (middle left) Source and target models (middle right) OT matrix on\nempirical joint distributions and with JDOT proxy joint distribution (right) estimated prediction\nfunction f.\n\ni , xt\n\ni , xt\n\nt which leads to a very good model for JDOT.\n\nto illustrate that the OT matrix \u03b3 obtained using the true empirical distribution Pt is very similar to\nthe one obtained with the proxy P f\nChoice of \u03b1. This is an important parameter balancing the alignment of feature space and labels. A\nnatural choice of the \u03b1 parameter is obtained by normalizing the range of values of d(xs\nj) with\n\u03b1 = 1/ maxi,j d(xs\nj). In the numerical experiment section, we show that this setting is very good\nin two out of three experiments. However, in some cases, better performances are obtained with a\ncross-validation of this parameter. Also note that \u03b1 is strongly linked to the smoothness of the loss\nL and of the optimal labelling functions and can be seen as a Lipschitz constant in the bound of\nTheorem 3.1.\nRelation to other optimal transport based DA methods. Previous DA methods based on optimal\ntransport [14, 17] do not not only differ by the nature of the considered distributions, but also in the\nway the optimal plan is used to \ufb01nd f. They learn a complex mapping between the source and target\ndistributions when the objective is only to estimate a prediction function f on target. To do so, they\nrely on a barycentric mapping that minimizes only approximately the Wasserstein distance between\nthe distributions. As discussed in Section 4, JDOT uses the optimal plan to propagate and fuse the\nlabels from the source to target. Not only are the performances enhanced, but we also show how this\napproach is more theoretically well grounded in next section 3.\nRelation to Transport Lp distances. Recently, Thorpe and co-authors introduced the Transportation\nLp distance [20]. Their objective is to compute a meaningful distance between multi-dimensional\nsignals. Interestingly their distance can be seen as optimal transport between two distributions of\nthe form (4) where the functions are known and the label loss L is chosen as a Lp distance. While\ntheir approach is inspirational, JDOT is different both in its formulation, where we introduce a more\ngeneral class of loss L, and in its objective, as our goal is to estimate the target function f which is\nnot known a priori. Finally we show theoretically and empirically that our formulation addresses\nsuccessfully the problem of domain adaptation.\n\n3 A Bound on the Target Error\n\ndef\n\nLet f be an hypothesis function from a given class of hypothesis H. We de\ufb01ne the expected loss in\n= E(x,y)\u223cPt L(y, f (x)). We de\ufb01ne similarly errS(f ) for the\nthe target domain errT (f ) as errT (f )\nsource domain. We assume the loss function L to be bounded, symmetric, k-lipschitz and satisfying\nthe triangle inequality.\nTo provide some guarantees on our method, we consider an adaptation of the notion probabilistic\nLipschitzness introduced in [21, 22] which assumes that two close instances must have the same\nlabels with high probability. It corresponds to a relaxation of the classic Lipschitzness allowing one to\nmodel the marginal-label relatedness such as in Nearest-Neighbor classi\ufb01cation, linear classi\ufb01cation\nor cluster assumption. We propose an extension of this notion in a domain adaptation context by\nassuming that a labeling function must comply with two close instances of each domain w.r.t. a\ncoupling \u03a0.\n\n4\n\n505x2.01.51.00.50.00.51.01.5yToy regression distributions2.50.02.55.0x1.00.50.00.51.0Toy regression modelsSource modelTarget modelSource samplesTarget samples2.50.02.55.0x1.00.50.00.51.0Joint OT matricesJDOT matrix linkOT matrix link2.50.02.55.0x1.00.50.00.51.0Model estimated with JDOTSource modelTarget modelJDOT model\fDe\ufb01nition (Probabilistic Transfer Lipschitzness) Let \u00b5s and \u00b5t be respectively the source and\ntarget distributions. Let \u03c6 : R \u2192 [0, 1]. A labeling function f : \u2126 \u2192 R and a joint distribution\n\u03a0(\u00b5s, \u00b5t) over \u00b5s and \u00b5t are \u03c6-Lipschitz transferable if for all \u03bb > 0:\n\nP r(x1,x2)\u223c\u03a0(\u00b5s,\u00b5t) [|f (x1) \u2212 f (x2)| > \u03bbd(x1, x2)] \u2264 \u03c6(\u03bb).\n\n(cid:82)\n\nf\n\nIntuitively, given a deterministic labeling functions f and a coupling \u03a0, it bounds the probability of\n\ufb01nding pairs of source-target instances labelled differently in a (1/\u03bb)-ball with respect to \u03a0.\nWe can now give our main result (simpli\ufb01ed version):\n\nof \u2208\n\nH.\n\nLet \u03a0\u2217\n\nbe\n\nany\n\nfunction\n\nlabeling\n\n(\u2126\u00d7C)2 \u03b1d(xs, xt) + L(ys, yt)d\u03a0(xs, ys; xt, yt) and W1( \u02c6Ps,\n\nTheorem 3.1 Let\n=\n\u02c6P f\nt ) the as-\nargmin\u03a0\u2208\u03a0(Ps,P f\nt )\nsociated 1-Wasserstein distance. Let f\u2217 \u2208 H be a Lipschitz labeling function that veri\ufb01es the\n\u03c6-probabilistic transfer Lipschitzness (PTL) assumption w.r.t. \u03a0\u2217 and that minimizes the joint error\nerrS(f\u2217) + errT (f\u2217) w.r.t all PTL functions compatible with \u03a0\u2217. We assume the input instances\nare bounded s.t. |f\u2217(x1) \u2212 f\u2217(x2)| \u2264 M for all x1, x2. Let L be any symmetric loss function,\nk-Lipschitz and satisfying the triangle inequality. Consider a sample of Ns labeled source instances\ndrawn from Ps and Nt unlabeled instances drawn from \u00b5t, and then for all \u03bb > 0, with \u03b1 = k\u03bb, we\nhave with probability at least 1 \u2212 \u03b4 that:\n\nerrT (f ) \u2264 W1( \u02c6Ps,\n\n\u02c6P f\nt ) +\n\nc(cid:48) log(\n\n2\n\u03b4\n\n)\n\n+\n\n1\u221a\nNT\n\nNS\n\n+ errS(f\u2217) + errT (f\u2217) + kM \u03c6(\u03bb).\n\n(cid:114) 2\n\n(cid:18) 1\u221a\n\n(cid:19)\n\nThe detailed proof of Theorem 3.1 is given in the supplementary material. The previous bound on the\ntarget error above is interesting to interpret. The \ufb01rst two terms correspond to the objective function\n(5) we propose to minimize accompanied with a sampling bound. The last term \u03c6(\u03bb) assesses the\nprobability under which the probabilistic Lipschitzness does not hold. The remaining two terms\ninvolving f\u2217 correspond to the joint error minimizer illustrating that domain adaptation can work\nonly if we can predict well in both domains, similarly to existing results in the literature [23, 24].\nIf the last terms are small enough, adaptation is possible if we are able to align well Ps and P f\nt ,\nprovided that f\u2217 and \u03a0\u2217 verify the PTL. Finally, note that \u03b1 = k\u03bb and tuning this parameter is thus\nactually related to \ufb01nding the Lipschitz constants of the problem.\n\n4 Learning with Joint Distribution OT\n\nIn this section, we provide some details about the JDOT\u2019s optimization problem given in Equation\n(5) and discuss algorithms for its resolution. We will assume that the function space H to which\nf belongs is either a RKHS or a function space parametrized by some parameters w \u2208 Rp. This\nframework encompasses linear models, neural networks, and kernel methods. Accordingly, we\nare going to de\ufb01ne a regularization term \u2126(f ) on f. Depending on how H is de\ufb01ned, \u2126(f ) is\neither a non-decreasing function of the squared-norm induced by the RKHS (so that the representer\ntheorem is applicable) or a squared-norm on the vector parameter. We will further assume that \u2126(f )\nis continuously differentiable. As discussed above, f is to be learned according to the following\noptimization problem\n\n(cid:88)\n\ni,j\n\n(cid:0)\u03b1d(xs\n\nmin\n\nf\u2208H,\u03b3\u2208\u2206\n\n\u03b3i,j\n\nj) + L(ys\n\ni , xt\n\ni , f (xt\n\nj))(cid:1) + \u03bb\u2126(f )\n\nwhere the loss function L is continuous and differentiable with respects to its second variable. Note\nthat while the above problem does not involve any regularization term on the coupling matrix \u03b3, it is\nessentially for the sake of simplicity and readability. Regularizers like entropic regularization [25],\nwhich is relevant when the number of samples is very large, can still be used without signi\ufb01cant\nchange to the algorithmic framework.\nOptimization procedure. According to the above hypotheses on f and L, Problem (6) is smooth\nand the constraints are separable according to f and \u03b3. Hence, a natural way to solve the problem (6)\nis to rely on alternate optimization w.r.t. both parameters \u03b3 and f. This algorithm well-known as\nBlock Coordinate Descent (BCD) or Gauss-Seidel method (the pseudo code of the algorithm is given\nin appendix). Block optimization steps are discussed with further details in the following.\n\n(6)\n\n5\n\n\fi , xt\n\nj) + L(ys\n\nSolving with \ufb01xed f boils down to a classical OT problem with a loss matrix C such that Ci,j =\nj)). We can use classical OT solvers such as the network simplex algorithm,\n\u03b1d(xs\nbut other strategies can be considered, such as regularized OT [25] or stochastic versions [26].\nThe optimization problem with \ufb01xed \u03b3 leads to a new learning problem expressed as\n\ni , f (xt\n\n\u03b3i,jL(ys\n\ni , f (xt\n\nj)) + \u03bb\u2126(f )\n\n(7)\n\n(cid:88)\n\ni,j\n\nmin\nf\u2208H\n\nNote how the data \ufb01tting term elegantly and naturally encodes the transfer of source labels ys\ni through\nestimated labels of test samples with a weighting depending on the optimal transport matrix. However,\nthis comes at the price of having a quadratic number NsNt of terms, which can be considered as\ncomputationally expensive. We will see in the sequel that we can bene\ufb01t from the structure of the\nchosen loss to greatly reduce its complexity. In addition, we emphasize that when H is a RKHS,\nowing to kernel trick and the representer theorem, problem (7) can be re-expressed as an optimization\nproblem with Nt number of parameters all belonging to R.\nLet us now discuss brie\ufb02y the convergence of the proposed algorithm. Owing to the 2-block coordinate\ndescent structure, to the differentiability of the objective function in Problem (6) and constraints on f\n(or its kernel trick parameters) and \u03b3 are closed, non-empty and convex, convergence result of Grippo\net al. [27] on 2-block Gauss-Seidel methods directly applies. It states that if the sequence {\u03b3k, f k}\nproduced by the algorithm has limit points then every limit point of the sequence is a critical point of\nProblem (6).\nEstimating f for least square regression problems. We detail the use of JDOT for transfer least-\nsquare regression problem i.e when L is the squared-loss. In this context, when the optimal transport\nmatrix \u03b3 is \ufb01xed the learning problem boils down to\n\nmin\nf\u2208H\n\n(cid:107)\u02c6yj \u2212 f (xt\n\nj)(cid:107)2 + \u03bb(cid:107)f(cid:107)2\n\n1\nnt\n\n(8)\n\n(cid:88)\n\nj\n\n(cid:80)\n\nj \u03b3i,jys\n\nwhere the \u02c6yj = nt\ni is a weighted average of the source target values. Note that this\nsimpli\ufb01cation results from the properties of the quadratic loss and that it may not occur for more\ncomplex regression loss.\nEstimating f for hinge loss classi\ufb01cation problems. We now aim at estimating a multiclass\nclassi\ufb01er with a one-against-all strategy. We suppose that the data \ufb01tting is the binary squared hinge\nloss of the form L(y, f (x)) = max(0, 1 \u2212 yf (x))2. In a One-Against-All strategy we often use the\ni,k = 0. Denote as fk \u2208 H the\nbinary matrices P such that P s\ndecision function related to the k-vs-all problem. The learning problem (7) can now be expressed as\n\ni,k = 1 if sample i is of class k else P s\n\n\u02c6Pj,kL(1, fk(xt\n\nj)) + (1 \u2212 \u02c6Pj,k)L(\u22121, fk(xt\n\nj)) + \u03bb\n\n(cid:107)fk(cid:107)2\n\n(9)\n\nj,k\n\nk\n\n(cid:88)\n\nmin\nfk\u2208H\n\n(cid:88)\n\n\u03b3(cid:62)Ps. Interestingly this formulation\nwhere \u02c6P is the transported class proportion matrix \u02c6P = 1\nNt\nillustrates that for each target sample, the data \ufb01tting term is a convex sum of hinge loss for a negative\nand positive label with weights in \u03b3.\n\n5 Numerical experiments\n\nIn this section we evaluate the performance of our method (JDOT) on two different transfer tasks of\nclassi\ufb01cation and regression on real datasets 2.\nCaltech-Of\ufb01ce classi\ufb01cation dataset. This dataset [28] is dedicated to visual adaptation. It contains\nimages from four different domains: Amazon, the Caltech-256 image collection, Webcam and DSLR.\nSeveral features, such as presence/absence of background, lightning conditions, image quality, etc.)\ninduce a distribution shift between the domains, and it is therefore relevant to consider a domain\nadaptation task to perform the classi\ufb01cation. Following [14], we choose deep learning features\nto represent the images, extracted as the weights of the fully connected 6th layer of the DECAF\nconvolutional neural network [29], pre-trained on ImageNet. The \ufb01nal feature vector is a sparse 4096\ndimensional vector.\n\n2Open Source Python implementation of JDOT: https://github.com/rflamary/JDOT\n\n6\n\n\fTable 1: Accuracy on the Caltech-Of\ufb01ce Dataset. Best value in bold.\n\nDomains\n\ncaltech\u2192amazon\ncaltech\u2192webcam\ncaltech\u2192dslr\namazon\u2192caltech\namazon\u2192webcam\namazon\u2192dslr\nwebcam\u2192caltech\nwebcam\u2192amazon\nwebcam\u2192dslr\ndslr\u2192caltech\ndslr\u2192amazon\ndslr\u2192webcam\n\nMean\n\nMean rank\n\np-value\n\nBase\n\n92.07\n76.27\n84.08\n84.77\n79.32\n86.62\n71.77\n79.44\n96.18\n77.03\n83.19\n96.27\n83.92\n5.33\n\nSurK\n\n91.65\n77.97\n82.80\n84.95\n81.36\n87.26\n71.86\n78.18\n95.54\n76.94\n82.15\n92.88\n83.63\n5.58\n\n< 0.01 < 0.01\n\nSA\n\nARTL\n\nOT-IT OT-MM JDOT\n\n90.50\n81.02\n85.99\n85.13\n85.42\n89.17\n75.78\n81.42\n94.90\n81.75\n83.19\n88.47\n85.23\n4.00\n0.01\n\n92.17\n80.00\n88.54\n85.04\n79.32\n85.99\n72.75\n79.85\n100.00\n78.45\n83.82\n98.98\n85.41\n3.75\n0.04\n\n89.98\n80.34\n78.34\n85.93\n74.24\n77.71\n84.06\n89.56\n99.36\n85.57\n90.50\n96.61\n86.02\n3.50\n0.25\n\n92.59\n78.98\n76.43\n87.36\n85.08\n79.62\n82.99\n90.50\n99.36\n83.35\n90.50\n96.61\n86.95\n2.83\n0.86\n\n91.54\n88.81\n89.81\n85.22\n84.75\n87.90\n82.64\n90.71\n98.09\n84.33\n88.10\n96.61\n89.04\n2.50\n\u2212\n\nTable 2: Accuracy on the Amazon review experiment. Maximum value in bold font.\n\nDomains\nbooks\u2192dvd\nbooks\u2192kitchen\nbooks\u2192electronics\ndvd\u2192books\ndvd\u2192kitchen\ndvd\u2192electronics\nkitchen\u2192books\nkitchen\u2192dvd\n\nkitchen\u2192electronics\nelectronics\u2192books\nelectronics\u2192dvd\nelectronics\u2192kitchen\n\nMean\np-value\n\nNN\n\n0.805\n0.768\n0.746\n0.725\n0.760\n0.732\n0.704\n0.723\n0.847\n0.713\n0.726\n0.855\n\n0.759\n0.004\n\nDANN JDOT (mse)\n0.806\n0.767\n0.747\n0.747\n0.765\n0.738\n0.718\n0.730\n0.846\n0.718\n0.726\n0.850\n\n0.794\n0.791\n0.778\n0.761\n0.811\n0.778\n0.732\n0.764\n0.844\n0.740\n0.738\n0.868\n\n0.763\n0.006\n\n0.783\n0.025\n\nJDOT (Hinge)\n\n0.795\n0.794\n0.781\n0.763\n0.821\n0.788\n0.728\n0.765\n0.845\n0.749\n0.737\n0.872\n\n0.787\n\n\u2212\n\nWe compare our method with four other methods: the surrogate kernel approach ([4], denoted\nSurK), subspace adaptation for its simplicity and good performances on visual adaptation ([8], SA),\nAdaptation Regularization based Transfer Learning ([30], ARTL), and the two variants of regularized\noptimal transport [14]: entropy-regularized OT-IT and classwise regularization implemented with the\nMajoration-Minimization algorithm OT-MM, that showed to give better results in practice than its\ngroup-lasso counterpart. The classi\ufb01cation is conducted with a SVM together with a linear kernel for\nevery method. Its results when learned on the source domain and tested on the target domain are also\nreported to serve as baseline (Base). All the methods have hyper-parameters, that are selected using\nthe reverse cross-validation of Zhong and colleagues [31].The dimension d for SA is chosen from\n{1, 4, 7, . . . , 31}. The entropy regularization for OT-IT and OT-MM is taken from {102, . . . , 105},\n102 being the minimum value for the Sinkhorn algorithm to prevent numerical errors. Finally the \u03b7\nparameter of OT-MM is selected from {1, . . . , 105} and the \u03b1 in JDOT from {10\u22125, 10\u22124, . . . , 1}.\nThe classi\ufb01cation accuracy for all the methods is reported in Table 1. We can see that JDOT\nis consistently outperforming the baseline (5 points in average), indicating that the adaptation is\nsuccessful in every cases. Its mean accuracy is the best as well as its average ranking. We conducted\na Wilcoxon signed-rank test to test if JDOT was statistically better than the other methods, and\nreport the p-value in the tables. This test shows that JDOT is statistically better than the considered\nmethods, except for OT based ones that where state of the art on this dataset [14].\nAmazon review classi\ufb01cation dataset We now consider the Amazon review dataset [32] which\ncontains online reviews of different products collected on the Amazon website. Reviews are encoded\nwith bag-of-word unigram and bigram features as input. The problem is to predict positive (higher\nthan 3 stars) or negative (3 stars or less) notation of reviews (binary classi\ufb01cation). Since different\n\n7\n\n\fTable 3: Comparison of different methods on the Wi\ufb01 localization dataset. Maximum value in bold.\n\nDomains\nt1 \u2192 t2\nt1 \u2192 t3\nt2 \u2192 t3\nhallway1\nhallway2\nhallway3\n\nKRR\n\n80.84\u00b11.14\n76.44\u00b12.66\n67.12\u00b11.28\n60.02 \u00b12.60\n49.38 \u00b1 2.30\n48.42 \u00b11.32\n\nSurK\n\n90.36\u00b11.22\n94.97\u00b11.29\n85.83 \u00b1 1.31\n76.36 \u00b1 2.44\n64.69 \u00b10.77\n65.73 \u00b1 1.57\n\nDIP\n\n87.98\u00b12.33\n84.20\u00b14.29\n80.58 \u00b1 2.10\n77.48 \u00b1 2.68\n78.54 \u00b1 1.66\n75.10\u00b1 3.39\n\nDIP-CC\n91.30\u00b13.24\n84.32\u00b14.57\n81.22 \u00b1 4.31\n76.24\u00b1 5.14\n77.8\u00b1 2.70\n73.40\u00b1 4.06\n\nGeTarS\n\n86.76 \u00b1 1.91\n90.62\u00b12.25\n82.68 \u00b1 3.71\n84.38 \u00b1 1.98\n77.38 \u00b1 2.09\n80.64 \u00b1 1.76\n\nCTC\n\n89.36\u00b11.78\n94.80\u00b10.87\n87.92 \u00b1 1.87\n86.98 \u00b1 2.02\n87.74 \u00b1 1.89\n82.02\u00b1 2.34\n\nCTC-TIP\n89.22\u00b11.66\n92.60 \u00b1 4.50\n89.52 \u00b1 1.14\n86.78 \u00b1 2.31\n87.94 \u00b1 2.07\n81.72 \u00b1 2.25\n\nJDOT\n\n93.03 \u00b1 1.24\n90.06 \u00b1 2.01\n86.76 \u00b1 1.72\n98.83\u00b10.58\n98.45\u00b10.67\n99.27\u00b10.41\n\ni , xt\n\nwords are employed to qualify the different categories of products, a domain adaptation task can be\nformulated if one wants to predict positive reviews of a product from labelled reviews of a different\nproduct. Following [33, 11], we consider only a subset of four different types of product: books,\nDVDs, electronics and kitchens. This yields 12 possible adaptation tasks. Each domain contains\n2000 labelled samples and approximately 4000 unlabelled ones. We therefore use these unlabelled\nsamples to perform the transfer, and test on the 2000 labelled data.\nThe goal of this experiment is to compare to the state-of-the-art method on this subset, namely\nDomain adversarial neural network ([11], denoted DANN), and to show the versatility of our method\nthat can adapt to any type of classi\ufb01er. The neural network used for all methods in this experiment is\na simple 2-layer model with sigmoid activation function in the hidden layer to promote non-linearity.\n50 neurons are used in this hidden layer. For DANN, hyper-parameters are set through the reverse\ncross-validation proposed in [11], and following the recommendation of authors the learning rate\nis set to 10\u22123. In the case of JDOT, we used the heuristic setting of \u03b1 = 1/ maxi,j d(xs\nj), and\nas such we do not need any cross-validation. The squared Euclidean norm is used for both metric\nin feature space and we test as loss functions both mean squared errors (mse) and Hinge losses. 10\niterations of the block coordinate descent are realized. For each method, we stop the learning process\nof the network after 5 epochs. Classi\ufb01cation accuracies are presented in table 2. The neural network\n(NN), trained on source and tested on target, is also presented as a baseline. JDOT surpasses DANN\nin 11 out of 12 tasks (except on books\u2192dvd). The Hinge loss is better in than mse in 10 out of 12\ncases, which is expected given the superiority of the Hinge loss on classi\ufb01cation tasks [19].\nWi\ufb01 localization regression dataset For the regression task, we use the cross-domain indoor Wi\ufb01\nlocalization dataset that was proposed by Zhang and co-authors [4], and recently studied in [5]. From\na multi-dimensional signal (collection of signal strength perceived from several access points), the\ngoal is to locate the device in a hallway, discretized into a grid of 119 squares, by learning a mapping\nfrom the signal to the grid element. This translates as a regression problem. As the signals were\nacquired at different time periods by different devices, a shift can be encountered and calls for an\nadaptation. In the remaining, we follow the exact same experimental protocol as in [4, 5] for ease of\ncomparison. Two cases of adaptation are considered: transfer across periods, for which three time\nperiods t1, t2 and t3 are considered, and transfer across devices, where three different devices are\nused to collect the signals in the same straight-line hallways (hallway1-3), leading to three different\nadaptation tasks in both cases.\nWe compare the result of our method with several state-of-the-art methods: kernel ridge regression\nwith RBF kernel (KRR), surrogate kernel ([4], denoted SurK), domain-invariant projection and its\ncluster regularized version ([7], denoted respectively DIP and DIP-CC), generalized target shift ([34],\ndenoted GeTarS), and conditional transferable components, with its target information preservation\nregularization ([5], denoted respectively CTC and CTC-TIP). As in [4, 5], the hyper-parameters of\nthe competing methods are cross-validated on a small subset of the target domain. In the case of\nJDOT, we simply set the \u03b1 to the heuristic value of \u03b1 = 1/ maxi,j d(xs\nj) as discussed previously,\nand f is estimated with kernel ridge regression.\nFollowing [4], the accuracy is measured in the following way: the prediction is said to be correct if it\nfalls within a range of three meters in the transfer across periods, and six meters in the transfer across\ndevices. For each experiment, we randomly sample sixty percent of the source and target domain, and\nreport the mean and standard deviation of ten repetitions accuracies in Table 3. For transfer across\nperiods, JDOT performs best in one out of three tasks. For transfer across devices, the superiority of\nJDOT is clearly assessed, for it reaches an average score > 98%, which is at least ten points ahead\nof the best competing method for every task. Those extremely good results could be explained by the\nfact that using optimal transport allows to consider large shifts of distribution, for which divergences\n(such as maximum mean discrepancy used in CTC) or reweighting strategies can not cope with.\n\ni , xt\n\n8\n\n\f6 Discussion and conclusion\n\nWe have presented in this paper the Joint Distribution Optimal Transport for domain adaptation,\nwhich is a principled way of performing domain adaptation with optimal transport. JDOT assumes\nthe existence of a transfer map that transforms a source domain joint distribution Ps(X, Y ) into\na target domain equivalent version Pt(X, Y ). Through this transformation, the alignment of both\nfeature space and conditional distributions is operated, allowing to devise an ef\ufb01cient algorithm that\nsimultaneously optimizes for a coupling between Ps and Pt and a prediction function that solves the\ntransfer problem. We also proved that learning with JDOT is equivalent to minimizing a bound on\nthe target distribution. We have demonstrated through experiments on classical real-world benchmark\ndatasets the superiority of our approach w.r.t. several state-of-the-art methods, including previous\nwork on optimal transport based domain adaptation, domain adversarial neural networks or transfer\ncomponents, on a variety of task including classi\ufb01cation and regression. We have also showed the\nversatility of our method, that can accommodate with several types of loss functions (mse, hinge) or\nclass of hypothesis (including kernel machines or neural networks). Potential follow-ups of this work\ninclude a semi-supervised extension (using unlabelled examples in source domain) and investigating\nstochastic techniques for solving ef\ufb01ciently the adaptation. From a theoretical standpoint, future\nworks include a deeper study of probabilistic transfer lipschitzness and the development of guarantees\nable to take into the complexity of the hypothesis class and the space of possible transport plans.\n\nAcknowledgements\n\nThis work bene\ufb01ted from the support of the project OATMIL ANR-17-CE23-0012 of the French\nNational Research Agency (ANR), the Normandie Projet GRR-DAISI, European funding FEDER\nDAISI and CNRS funding from the D\u00e9\ufb01 Imag\u2019In. The authors also wish to thank Kai Zhang and\nQiaojun Wang for providing the Wi\ufb01 localization dataset.\n\nReferences\n[1] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: an overview of recent\n\nadvances. IEEE Signal Processing Magazine, 32(3), 2015.\n\n[2] S. J. Pan and Q. Yang. A survey on transfer learning.\n\nEngineering, 22(10):1345\u20131359, 2010.\n\nIEEE Transactions on Knowledge and Data\n\n[3] M. Sugiyama, S. Nakajima, H. Kashima, P.V. Buenau, and M. Kawanabe. Direct importance estimation\n\nwith model selection and its application to covariate shift adaptation. In NIPS, 2008.\n\n[4] K. Zhang, V. W. Zheng, Q. Wang, J. T. Kwok, Q. Yang, and I. Marsic. Covariate shift in Hilbert space: A\n\nsolution via surrogate kernels. In ICML, 2013.\n\n[5] M. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, and B. Sch\u00f6lkopf. Domain adaptation with conditional\n\ntransferable components. In ICML, volume 48, pages 2839\u20132848, 2016.\n\n[6] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic \ufb02ow kernel for unsupervised domain adaptation. In\n\nCVPR, 2012.\n\n[7] M. Baktashmotlagh, M. Harandi, B. Lovell, and M. Salzmann. Unsupervised domain adaptation by domain\n\ninvariant projection. In ICCV, pages 769\u2013776, 2013.\n\n[8] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using\n\nsubspace alignment. In ICCV, 2013.\n\n[9] M. Long, J. Wang, G. Ding, J. Sun, and P. Yu. Transfer joint matching for unsupervised domain adaptation.\n\nIn CVPR, pages 1410\u20131417, 2014.\n\n[10] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation.\n\n1180\u20131189, 2015.\n\nIn ICML, pages\n\n[11] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky.\nDomain-adversarial training of neural networks. Journal of Machine Learning Research, 17(59):1\u201335,\n2016.\n\n9\n\n\f[12] S. Si, D. Tao, and B. Geng. Bregman divergence-based regularization for transfer subspace learning. IEEE\n\nTransactions on Knowledge and Data Engineering, 22(7):929\u2013942, July 2010.\n\n[13] N. Courty, R. Flamary, and D. Tuia. Domain adaptation with regularized optimal transport. In ECML/PKDD,\n\n2014.\n\n[14] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain adaptation. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 2016.\n\n[15] L. Kantorovich. On the translocation of masses. C.R. (Doklady) Acad. Sci. URSS (N.S.), 37:199\u2013201,\n\n1942.\n\n[16] F. Santambrogio. Optimal transport for applied mathematicians. Birk\u00e4user, NY, 2015.\n\n[17] M. Perrot, N. Courty, R. Flamary, and A. Habrard. Mapping estimation for discrete optimal transport. In\n\nNIPS, pages 4197\u20134205, 2016.\n\n[18] C. Villani. Optimal transport: old and new. Grund. der mathematischen Wissenschaften. Springer, 2009.\n\n[19] Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana, and Alessandro Verri. Are loss\n\nfunctions all the same? Neural Computation, 16(5):1063\u20131076, 2004.\n\n[20] M. Thorpe, S. Park, S. Kolouri, G. Rohde, and D. Slepcev. A transportation lp distance for signal analysis.\n\nCoRR, abs/1609.08669, 2016.\n\n[21] R. Urner, S. Shalev-Shwartz, and S. Ben-David. Access to unlabeled data can speed up prediction time. In\n\nProceedings of ICML, pages 641\u2013648, 2011.\n\n[22] S. Ben-David, S. Shalev-Shwartz, and R. Urner. Domain adaptation\u2013can quantity compensate for quality?\n\nIn Proc of ISAIM, 2012.\n\n[23] Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In\n\nProc. of COLT, 2009.\n\n[24] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. Wortman Vaughan. A theory of\n\nlearning from different domains. Machine Learning, 79(1-2):151\u2013175, 2010.\n\n[25] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013.\n\n[26] A. Genevay, M. Cuturi, G. Peyr\u00e9, and F. Bach. Stochastic optimization for large-scale optimal transport. In\n\nNIPS, pages 3432\u20133440, 2016.\n\n[27] Luigi Grippo and Marco Sciandrone. On the convergence of the block nonlinear gauss\u2013seidel method\n\nunder convex constraints. Operations research letters, 26(3):127\u2013136, 2000.\n\n[28] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In ECCV,\n\nLNCS, pages 213\u2013226, 2010.\n\n[29] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional\n\nactivation feature for generic visual recognition. In ICML, 2014.\n\n[30] M. Long, J. Wang, G. Ding, S. Jialin Pan, and P.S. Yu. Adaptation regularization: A general framework for\n\ntransfer learning. IEEE TKDE, 26(7):1076\u20131089, 2014.\n\n[31] E. Zhong, W. Fan, Q. Yang, O. Verscheure, and J. Ren. Cross validation framework to choose amongst\n\nmodels and datasets for transfer learning. In ECML/PKDD, 2010.\n\n[32] J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In\nProc. of the 2006 conference on empirical methods in natural language processing, pages 120\u2013128, 2006.\n\n[33] M. Chen, Z. Xu, K. Weinberger, and F. Sha. Marginalized denoising autoencoders for domain adaptation.\n\nIn ICML, 2012.\n\n[34] K. Zhang, M. Gong, and B. Sch\u00f6lkopf. Multi-source domain adaptation: A causal view.\n\nConference on Arti\ufb01cial Intelligence, pages 3150\u20133157, 2015.\n\nIn AAAI\n\n10\n\n\f", "award": [], "sourceid": 2081, "authors": [{"given_name": "Nicolas", "family_name": "Courty", "institution": "IRISA / University South Brittany"}, {"given_name": "R\u00e9mi", "family_name": "Flamary", "institution": "Universit\u00e9 C\u00f4te d'Azur"}, {"given_name": "Amaury", "family_name": "Habrard", "institution": "University of Saint-Etienne, Univ. Lyon, Lab. H Curien, France"}, {"given_name": "Alain", "family_name": "Rakotomamonjy", "institution": "Universit\u00e9 de Rouen Normandie"}]}