{"title": "Screening Sinkhorn Algorithm for Regularized Optimal Transport", "book": "Advances in Neural Information Processing Systems", "page_first": 12169, "page_last": 12179, "abstract": "We introduce in this paper a novel strategy for efficiently approximating the Sinkhorn distance between two discrete measures. After identifying neglectable components of the dual solution of the regularized Sinkhorn problem, we propose to screen those components by directly setting them at that value before entering the Sinkhorn problem. This allows us to solve a smaller Sinkhorn problem while ensuring approximation with provable guarantees. More formally, the approach is based on a new formulation of dual of Sinkhorn divergence problem and on the KKT optimality conditions of this problem, which enable identification of dual components to be screened. This new analysis leads to the Screenkhorn algorithm. We illustrate the efficiency of Screenkhorn on complex tasks such as dimensionality reduction and domain adaptation involving regularized optimal transport.", "full_text": "Screening Sinkhorn Algorithm for Regularized\n\nOptimal Transport\n\nMokhtar Z. Alaya\n\nLITIS EA4108\n\nUniversity of Rouen Normandy\n\nmokhtarzahdi.alaya@gmail.com\n\nMaxime B\u00e9rar\nLITIS EA4108\n\nUniversity of Rouen Normandy\nmaxime.berar@univ-rouen.fr\n\nGilles Gasso\nLITIS EA4108\n\nINSA, University of Rouen Normandy\n\ngilles.gasso@insa-rouen.fr\n\nAlain Rakotomamonjy\n\nLITIS EA4108\n\nUniversity of Rouen Normandy\nand Criteo AI Lab, Criteo Paris\nalain.rakoto@insa-rouen.fr\n\nAbstract\n\nWe introduce in this paper a novel strategy for ef\ufb01ciently approximating the\nSinkhorn distance between two discrete measures. After identifying neglectable\ncomponents of the dual solution of the regularized Sinkhorn problem, we propose\nto screen those components by directly setting them at that value before entering\nthe Sinkhorn problem. This allows us to solve a smaller Sinkhorn problem while\nensuring approximation with provable guarantees. More formally, the approach\nis based on a new formulation of dual of Sinkhorn divergence problem and on\nthe KKT optimality conditions of this problem, which enable identi\ufb01cation of\ndual components to be screened. This new analysis leads to the SCREENKHORN\nalgorithm. We illustrate the ef\ufb01ciency of SCREENKHORN on complex tasks such\nas dimensionality reduction and domain adaptation involving regularized optimal\ntransport.\n\n1\n\nIntroduction\n\nComputing optimal transport (OT) distances between pairs of probability measures or histograms,\nsuch as the earth mover\u2019s distance [39, 34] and Monge-Kantorovich or Wasserstein distance [38],\nare currently generating an increasing attraction in different machine learning tasks [37, 28, 4, 22],\nstatistics [18, 32, 14, 6, 17], and computer vision [8, 34, 36], among other applications [27, 33]. In\nmany of these problems, OT exploits the geometric features of the objects at hand in the underlying\nspaces to be leveraged in comparing probability measures. This effectively leads to improved\nperformance of methods that are oblivious to the geometry, for example the chi-squared distances or\nthe Kullback-Leibler divergence. Unfortunately, this advantage comes at the price of an enormous\ncomputational cost of solving the OT problem, that can be prohibitive in large scale applications.\nFor instance, the OT between two histograms with supports of equal size n can be formulated as a\nlinear programming problem that requires generally super O(n2.5) [29] arithmetic operations, which\nis problematic when n becomes larger.\nA remedy to the heavy computation burden of OT lies in a prevalent approach referred to as regularized\nOT [11] and operates by adding an entropic regularization penalty to the original problem. Such a\nregularization guarantees a unique solution, since the objective function is strongly convex, and a\ngreater computational stability. More importantly, this regularized OT can be solved ef\ufb01ciently with\ncelebrated matrix scaling algorithms, such as Sinkhorn\u2019s \ufb01xed point iteration method [35, 26, 23].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fSeveral works have considered further improvements in the resolution of this regularized OT problem.\nA greedy version of Sinkhorn algorithm, called Greenkhorn [3], allows to select and update columns\nand rows that most violate the polytope constraints. Another approach based on low-rank approxima-\ntion of the cost matrix using the Nystr\u00f6m method induces the Nys-Sink algorithm [2]. Other classical\noptimization algorithms have been considered for approximating the OT, for instance accelerated\ngradient descent [40, 13, 30], quasi-Newton methods [7, 12] and stochastic gradient descent [20, 1].\nIn this paper, we propose a novel technique for accelerating the Sinkhorn algorithm when computing\nregularized OT distance between discrete measures. Our idea is strongly related to a screening\nstrategy when solving a Lasso problem in sparse supervised learning [21]. Based on the fact that\na transport plan resulting from an OT problem is sparse or presents a large number of neglectable\nvalues [7], our objective is to identify the dual variables of an approximate Sinkhorn problem, that are\nsmaller than a prede\ufb01ned threshold, and thus that can be safely removed before optimization while\nnot altering too much the solution of the problem. Within this global context, our contributions are\nthe following:\n\u2022 From a methodological point of view, we propose a new formulation of the dual of the Sinkhorn\ndivergence problem by imposing variables to be larger than a threshold. This formulation allows\nus to introduce suf\ufb01cient conditions, computable beforehand, for a variable to strictly satisfy its\nconstraint, leading then to a \u201cscreened\u201d version of the dual of Sinkhorn divergence.\n\u2022 We provide some theoretical analysis of the solution of the \u201cscreened\u201d Sinkhorn divergence,\nshowing that its objective value and the marginal constraint satisfaction are properly controlled as\nthe number of screened variables decreases.\n\u2022 From an algorithmic standpoint, we use a constrained L-BFGS-B algorithm [31, 9] but provide a\ncareful analysis of the lower and upper bounds of the dual variables, resulting in a well-posed and\nef\ufb01cient algorithm denoted as SCREENKHORN.\n\u2022 Our empirical analysis depicts how the approach behaves in a simple Sinkhorn divergence\ncomputation context. When considered in complex machine learning pipelines, we show that\nSCREENKHORN can lead to strong gain in ef\ufb01ciency while not compromising on accuracy.\n\nThe remainder of the paper is organized as follow. In Section 2 we brie\ufb02y review the basic setup\nof regularized discrete OT. Section 3 contains our main contribution, that is, the SCREENKHORN\nalgorithm. Section 4 is devoted to theoretical guarantees for marginal violations of SCREENKHORN.\nIn Section 5 we present numerical results for the proposed algorithm, compared with the state-of-art\nSinkhorn algorithm as implemented in [16]. The proofs of theoretical results are postponed to the\nsupplementary material as well as additional empirical results.\n\nNotation. For any positive matrix T \u2208 Rn\u00d7m, we de\ufb01ne its entropy as H(T ) = \u2212(cid:80)\n(cid:104)T, W(cid:105) = tr(T (cid:62)W ) =(cid:80)\n\ni,j Tij log(Tij).\nLet r(T ) = T 1m \u2208 Rn and c(T ) = T (cid:62)1n \u2208 Rm denote the rows and columns sums of T\nrespectively. The coordinates ri(T ) and cj(T ) denote the i-th row sum and the j-th column sum of\nT , respectively. The scalar product between two matrices denotes the usual inner product, that is\ni,j TijWij, where T (cid:62) is the transpose of T . We write 1 (resp. 0) the vector\nhaving all coordinates equal to one (resp. zero). \u2206(w) denotes the diag operator, such that if w \u2208 Rn,\nthen \u2206(w) = diag(w1, . . . , wn) \u2208 Rn\u00d7n. For a set of indices L = {i1, . . . , ik} \u2286 {1, . . . , n}\nsatisfying i1 < \u00b7\u00b7\u00b7 < ik, we denote the complementary set of L by L\n= {1, . . . , n}\\L. We also\n(cid:123)\ndenote |L| the cardinality of L. Given a vector w \u2208 Rn, we denote wL = (wi1, . . . , wik )(cid:62) \u2208 Rk\nand its complementary wL(cid:123) \u2208 Rn\u2212k. The notation is similar for matrices; given another subset\nof indices S = {j1, . . . , jl} \u2286 {1, . . . , m} with j1 < \u00b7\u00b7\u00b7 < jl, and a matrix T \u2208 Rn\u00d7m, we use\nT(L,S), to denote the submatrix of T , namely the rows and columns of T(L,S) are indexed by L and\nS respectively. When applied to matrices and vectors, (cid:12) and (cid:11) (Hadamard product and division)\nand exponential notations refer to elementwise operators. Given two real numbers a and b, we write\na \u2228 b = max(a, b) and a \u2227 b = min(a, b).\n\n2 Regularized discrete OT\n\n\u00b5 =(cid:80)n\n\nWe brie\ufb02y expose in this section the setup of OT between two discrete measures. We then consider\nthe case when those distributions are only available through a \ufb01nite number of samples, that is\nj=1 \u03bdi\u03b4yj \u2208 \u03a3m, where \u03a3n is the probability simplex with n bins,\n\ni=1 \u00b5i\u03b4xi \u2208 \u03a3n and \u03bd =(cid:80)m\n\n2\n\n\fnamely the set of probability vectors in Rn\nprobabilistic couplings set as \u03a0(\u00b5, \u03bd) = {P \u2208 Rn\u00d7m\n\n+, i.e., \u03a3n = {w \u2208 Rn\n\n, P 1m = \u00b5, P (cid:62)1n = \u03bd}.\n\ni=1 wi = 1}. We denote their\n\n+\n\n+ :(cid:80)n\n\nSinkhorn divergence. Computing OT distance between the two discrete measures \u00b5 and \u03bd amounts\nto solving a linear problem [25] given by\n\nS(\u00b5, \u03bd) = min\n\n(cid:104)C, P(cid:105),\n\nP\u2208\u03a0(\u00b5,\u03bd)\n\nwhere P = (Pij) \u2208 Rn\u00d7m is called the transportation plan, namely each entry Pij represents the\nfraction of mass moving from xi to yj, and C = (Cij) \u2208 Rn\u00d7m is a cost matrix comprised of\nnonnegative elements and related to the energy needed to move a probability mass from xi to yj. The\nentropic regularization of OT distances [11] relies on the addition of a penalty term as follows:\n\nS\u03b7(\u00b5, \u03bd) = min\n\nP\u2208\u03a0(\u00b5,\u03bd)\n\n{(cid:104)C, P(cid:105) \u2212 \u03b7H(P )},\n\n(1)\n\nwhere \u03b7 > 0 is a regularization parameter. We refer to S\u03b7(\u00b5, \u03bd) as the Sinkhorn divergence [11].\nDual of Sinkhorn divergence. Below we provide the derivation of the dual problem for the\nregularized OT problem (1). Towards this end, we begin with writing its Lagrangian dual function:\n\nexp(cid:0) \u2212 1\n\n\u03b7 (wi + zj + Cij) \u2212 1(cid:1), for all i = 1, . . . , n and j = 1, . . . , m. Plugging this solution, and\n\n(cid:62)1n \u2212 \u03bd(cid:105).\nL (P, w, z) = (cid:104)C, P(cid:105) + \u03b7(cid:104)log P, P(cid:105) + (cid:104)w, P 1m \u2212 \u00b5(cid:105) + (cid:104)z, P\nL (P, w, z). It is easy\nThe dual of Sinkhorn divergence can be derived by solving minP\u2208Rn\u00d7m\nto check that objective function P (cid:55)\u2192 L (P, w, z) is strongly convex and differentiable. Hence,\none can solve the latter minimum by setting \u2207P L (P, w, z) to 0n\u00d7m. Therefore, we get P (cid:63)\nij =\nsetting the change of variables u = \u2212w/\u03b7 \u2212 1/2 and v = \u2212z/\u03b7 \u2212 1/2, the dual problem is given by\n(2)\nwhere B(u, v) := \u2206(eu)K\u2206(ev) and K := e\u2212C/\u03b7 stands for the Gibbs kernel associated to the cost\nmatrix C. We refer to problem (2) as the dual of Sinkhorn divergence. Then, the optimal solution P (cid:63)\nof the primal problem (1) takes the form P (cid:63) = \u2206(eu(cid:63)\n) where the couple (u(cid:63), v(cid:63)) satis\ufb01es:\n(u(cid:63), v(cid:63)) = argmin\nu\u2208Rn,v\u2208Rm\n\nn B(u, v)1m \u2212 (cid:104)u, \u00b5(cid:105) \u2212 (cid:104)v, \u03bd(cid:105)(cid:9),\n\n(cid:8)\u03a8(u, v) := 1(cid:62)\n\n)K\u2206(ev(cid:63)\n{\u03a8(u, v)}.\n\nu\u2208Rn,v\u2208Rm\n\nmin\n\n+\n\nNote that the matrices \u2206(eu(cid:63)\n) are unique up to a constant factor [35]. Moreover, P (cid:63)\ncan be solved ef\ufb01ciently by iterative Bregman projections [5] referred to as Sinkhorn iterations, and\nthe method is referred to as SINKHORN algorithm which, recently, has been proven to achieve a\nnear-O(n2) complexity [3].\n\n) and \u2206(ev(cid:63)\n\n3 Screened dual of Sinkhorn divergence\n\nMotivation. The key idea of our approach is motivated by the\nso-called static screening test [21] in supervised learning, which is\na method able to safely identify inactive features, i.e., features that\nhave zero components in the solution vector. Then, these inactive\nfeatures can be removed from the optimization problem to reduce its\nscale. Before diving into detailed algorithmic analysis, let us present\na brief illustration of how we adapt static screening test to the dual\nof Sinkhorn divergence. Towards this end, we de\ufb01ne the convex set\nCr\n\u03b1 \u2286 Rr, for r \u2208 N and \u03b1 > 0, by Cr\n\u03b1 = {w \u2208 Rr : ewi \u2265 \u03b1}.\nIn Figure 1, we plot (eu(cid:63)\n) where (u(cid:63), v(cid:63)) is the pair solution\nof the dual of Sinkhorn divergence (2) in the particular case of:\nn = m = 500, \u03b7 = 1, \u00b5 = \u03bd = 1\nthe cost matrix C corresponds to the pairwise euclidean distance, i.e., Cij = (cid:107)xi \u2212 yj(cid:107)2. We also\nplot two lines corresponding to eu(cid:63) \u2261 \u03b1u and ev(cid:63) \u2261 \u03b1v for some \u03b1u > 0 and \u03b1v > 0, choosing\nrandomly and playing the role of thresholds to select indices to be discarded. If we are able to identify\nthese indices before solving the problem, they can be \ufb01xed at the thresholds and removed then from\nthe optimization procedure yielding an approximate solution.\n\nFigure 1: Plots of (eu(cid:63)\n, ev(cid:63)\n)\nwith (u(cid:63), v(cid:63)) is the pair solu-\ntion of dual of Sinkhorn diver-\ngence (2) and the thresholds\n\u03b1u, \u03b1v.\n\n0 1 )), yj \u223c N ((3, 3)(cid:62),(cid:0) 1 \u22120.8\n\nn 1n, xi \u223c N ((0, 0)(cid:62), ( 1 0\n\n(cid:1)) and\n\n, ev(cid:63)\n\n\u22120.8\n\n1\n\n3\n\n01002003004005000.00200.0025eu?ev?\u03b1u\u03b1v\fStatic screening test. Based on this idea, we de\ufb01ne a so-called approximate dual of Sinkhorn\ndivergence\n\nn B(u, v)1m \u2212 (cid:104)\u03bau, \u00b5(cid:105) \u2212 (cid:104) v\n\u03ba\n\n(3)\n\n(cid:8)\u03a8\u03ba(u, v) := 1(cid:62)\n\nmin\n\n,v\u2208Cm\n\n\u03b5\u03ba\n\nu\u2208Cn\n\n\u03b5\n\u03ba\n\n, \u03bd(cid:105)(cid:9),\n\nwhich is simply a dual of Sinkhorn divergence with lower-bounded variables, where the bounds are\n\u03b1u = \u03b5\u03ba\u22121 and \u03b1v = \u03b5\u03ba with \u03b5 > 0 and \u03ba > 0 being \ufb01xed numeric constants which values will be\nclear later. The new formulation (3) has the form of (\u03ba\u00b5, \u03bd/\u03ba)-scaling problem under constraints on\nthe variables u and v. Those constraints make the problem signi\ufb01cantly different from the standard\nscaling-problems [24]. We further emphasize that \u03ba plays a key role in our screening strategy. Indeed,\nwithout \u03ba, eu and ev can have inversely related scale that may lead in, for instance eu being too large\nand ev being too small, situation in which the screening test would apply only to coef\ufb01cients of eu or\nev and not for both of them. Moreover, it is clear that the approximate dual of Sinkhorn divergence\ncoincides with the dual of Sinkhorn divergence (2) when \u03b5 = 0 and \u03ba = 1. Intuitively, our hope is\nto gain ef\ufb01ciency in solving problem (3) compared to the original one in Equation (2) by avoiding\noptimization of variables smaller than the threshold and by identifying those that make the constraints\nactive. More formally, the core of the static screening test aims at locating two subsets of indices\n(I, J) in {1, . . . , n} \u00d7 {1, . . . , m} satisfying: eui > \u03b1u, and evj > \u03b1v, for all (i, j) \u2208 I \u00d7 J and\neui(cid:48) = \u03b1u, and evj(cid:48) = \u03b1v, for all (i(cid:48), j(cid:48)) \u2208 I\n\u03b1v. The following\nkey result states suf\ufb01cient conditions for identifying variables in I\nLemma 1. Let (u\u2217, v\u2217) be an optimal solution of problem (3). De\ufb01ne\n\n(cid:123), namely (u, v) \u2208 Cn\n(cid:123).\n(cid:123) and J\n\n(cid:123) \u00d7 J\n\n\u00d7 Cm\n\n\u03b1u\n\nI\u03b5,\u03ba =(cid:8)i = 1, . . . , n : \u00b5i \u2265 \u03b52\n\nri(K)(cid:9), J\u03b5,\u03ba =(cid:8)j = 1, . . . , m : \u03bdj \u2265 \u03ba\u03b52cj(K)(cid:9)\n\n(4)\n\n\u03ba\n\nj = \u03b5\u03ba for all i \u2208 I\n\ni = \u03b5\u03ba\u22121 and ev\u2217\n\nThen one has eu\u2217\nProof of Lemma 1 is postponed to the supplementary material. It is worth to note that \ufb01rst order\noptimality conditions applied to (u\u2217, v\u2217) ensure that if eu\u2217\n)i = \u03ba\u00b5i and if\nev\u2217\n)j = \u03ba\u22121\u03bdj, that correspond to the Sinkhorn marginal conditions [33] up\nto the scaling factor \u03ba.\n\ni > \u03b5\u03ba\u22121 then eu\u2217\n\nj > \u03b5\u03ba then ev\u2217\n\n\u03b5,\u03ba and j \u2208 J\n(cid:123)\n\nj (K(cid:62)eu\u2217\n\ni (Kev\u2217\n\n(cid:123)\n\u03b5,\u03ba.\n\n(cid:115)\n\nScreening with a \ufb01xed number budget of points. The approximate dual of Sinkhorn divergence\nis de\ufb01ned with respect to \u03b5 and \u03ba. As those parameters are dif\ufb01cult to interpret, we exhibit their\nrelations with a \ufb01xed number budget of points from the supports of \u00b5 and \u03bd. In the sequel, we denote\nby nb \u2208 {1, . . . , n} and mb \u2208 {1, . . . , m} the number of points that are going to be optimized in\nproblem (3), i.e, the points we cannot guarantee that eu\u2217\nLet us de\ufb01ne \u03be \u2208 Rn and \u03b6 \u2208 Rm to be the ordered decreasing vectors\nof \u00b5 (cid:11) r(K) and \u03bd (cid:11) c(K) respectively, that is \u03be1 \u2265 \u03be2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03ben\nand \u03b61 \u2265 \u03b62 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03b6m. To keep only nb-budget and mb-budget of\npoints, the parameters \u03ba and \u03b5 satisfy \u03b52\u03ba\u22121 = \u03benb and \u03b52\u03ba = \u03b6mb.\nHence\n\ni = \u03b5\u03ba\u22121 and ev\u2217\n\nj = \u03b5\u03ba .\n\n.\n\n\u03b6mb\n\u03benb\n\n\u03b5 = (\u03benb \u03b6mb )1/4 and \u03ba =\n\n(5)\nThis guarantees that |I\u03b5,\u03ba| = nb and |J\u03b5,\u03ba| = mb by construction. In\naddition, (nb, mb) tends to the full number budget of points (n, m),\nwhen the couple parameters (\u03b5, \u03ba) converges to (0, 1). In Figure 2,\nwe plot these convergences, and hence the objective in problem (3)\nconverges to the objective of dual of Sinkhorn divergence (2).\nWe are now in position to formulate the optimization problem related to the screened dual of Sinkhorn.\ni \u2265 \u03b5\u03ba\u22121 and\nIndeed, using the above analyses, any solution (u\u2217, v\u2217) of problem (3) satis\ufb01es eu\u2217\nj \u2265 \u03b5\u03ba for all (i, j) \u2208 (I\u03b5,\u03ba \u00d7 J\u03b5,\u03ba), and eu\u2217\n(cid:123)\nev\u2217\n\u03b5,\u03ba).\nHence, we can restrict the problem (3) to variables in I\u03b5,\u03ba and J\u03b5,\u03ba. This boils down to restricting the\nconstraints feasibility Cn\n\nFigure 2: Plots of \u03b5 and \u03ba\nas a function of number bud-\nget of points for a screening\ntest with nb = mb and the\nparameters \u00b5, \u03bd, \u03b7, C are set\nas in Figure (1). (\u03b5, \u03ba) tends\nto (0, 1) as (nb, mb) tends to\n(n, m).\n\n\u03b5\u03ba to the screened domain de\ufb01ned by Usc \u2229 Vsc,\n\nj = \u03b5\u03ba for all (i, j) \u2208 (I\n\ni = \u03b5\u03ba\u22121 and ev\u2217\n\n\u03b5,\u03ba \u00d7 J\n(cid:123)\n\n\u2229 Cm\n\n\u03b5\n\u03ba\n\nUsc = {u \u2208 Rnb : euI\u03b5,\u03ba (cid:23) \u03b5\n\u03ba\n\n1nb} and Vsc = {v \u2208 Rmb : evJ\u03b5,\u03ba (cid:23) \u03b5\u03ba1mb}\n\n4\n\n2505000510\u03b52505000.00.51.0\u03ba\fwhere the vector comparison (cid:23) has to be understood elementwise. And, by replacing in Equation (3),\n(cid:123)\n\u03b5,\u03ba) by \u03b5\u03ba\u22121 and \u03b5\u03ba, we derive the screened dual of Sinkhorn\nthe variables belonging to (I\ndivergence problem as\n\n\u03b5,\u03ba \u00d7 J\n(cid:123)\n\nmin\n\nu\u2208Usc,v\u2208Vsc\n\n{\u03a8\u03b5,\u03ba(u, v)}\n\n(6)\n\n\u221211(cid:62)\n\nnb K(I(cid:123)\n\n\u03b5,\u03ba,J\u03b5,\u03ba)evJ\u03b5,\u03ba\n\nwhere\nK(I\u03b5,\u03ba,J\u03b5,\u03ba)evJ\u03b5,\u03ba + \u03b5\u03ba(euI\u03b5,\u03ba )(cid:62)\n\u03a8\u03b5,\u03ba(u, v) = (euI\u03b5,\u03ba )(cid:62)\nI\u03b5,\u03ba uI\u03b5,\u03ba \u2212 \u03ba\n\u2212 \u03ba\u00b5\n(cid:62)\n\n(cid:62)\nJ\u03b5,\u03bavJ\u03b5,\u03ba + \u039e\n\n\u22121\u03bd\n\nKij \u2212 \u03ba log(\u03b5\u03ba\u22121)(cid:80)\n\nwith \u039e = \u03b52(cid:80)\n\nK(I\u03b5,\u03ba,J(cid:123)\n\n\u03b5,\u03ba)1mb + \u03b5\u03ba\n\n\u00b5i \u2212 \u03ba\u22121 log(\u03b5\u03ba)(cid:80)\n\n\u03b5,\u03ba\n\n\u03b5,\u03ba\n\n\u03b5,\u03ba\n\ni\u2208I(cid:123)\n\ni\u2208I(cid:123)\n\nj\u2208J(cid:123)\n\n\u03b5,\u03ba,j\u2208J(cid:123)\n\n\u03bdj.\nThe above problem uses only the restricted parts K(I\u03b5,\u03ba,J\u03b5,\u03ba), K(I\u03b5,\u03ba,J(cid:123)\n\u03b5,\u03ba), and K(I(cid:123)\n\u03b5,\u03ba,J\u03b5,\u03ba) of the\nGibbs kernel K for calculating the objective function \u03a8\u03b5,\u03ba. Hence, a gradient descent scheme will\nalso need only those rows/columns of K. This is in contrast to Sinkhorn algorithm which performs\nalternating updates of all rows and columns of K. In summary, SCREENKHORN consists of two steps:\nthe \ufb01rst one is a screening pre-processing providing the active sets I\u03b5,\u03ba, J\u03b5,\u03ba. The second one consists\nin solving Equation (6) using a constrained L-BFGS-B [9] for the stacked variable \u03b8 = (uI\u03b5,\u03ba, vJ\u03b5,\u03ba).\nPseudocode of our proposed algorithm is shown in Algorithm 1. Note that in practice, we initialize\nthe L-BFGS-B algorithm based on the output of a method, called RESTRICTED SINKHORN (see\nAlgorithm 2 in the supplementary), which is a Sinkhorn-like algorithm applied to the active dual\nvariables \u03b8 = (uI\u03b5,\u03ba, vJ\u03b5,\u03ba). While simple and ef\ufb01cient, the solution of this RESTRICTED SINKHORN\nalgorithm does not satisfy the lower bound constraints of Problem (6) but provide a good candidate\nsolution. Also note that L-BFGS-B handles box constraints on variables, but it becomes more\nef\ufb01cient when these box bounds are carefully determined for problem (6). The following proposition\n(proof in supplementary material) expresses these bounds that are pre-calculated in the initialization\nstep of SCREENKHORN.\nProposition 1. Let (usc, vsc) be an optimal pair solution of problem (6) and Kmin =\nThen, one has\n\ni\u2208I\u03b5,\u03ba,j\u2208J\u03b5,\u03ba\n\nmin\n\nKij.\n\nand\n\n\u2228\n\n\u03b5\n\u03ba\n\n\u03b5\u03ba \u2228\n\nfor all i \u2208 I\u03b5,\u03ba and j \u2208 J\u03b5,\u03ba.\n\nmini\u2208I\u03b5,\u03ba \u00b5i\n\n\u03b5(m \u2212 mb) + \u03b5 \u2228 maxj\u2208J\u03b5,\u03ba \u03bdj\n\nn\u03b5\u03baKmin\n\n\u2264 eusc\n\ni \u2264 \u03b5\n\u03ba\n\n\u2228 maxi\u2208I\u03b5,\u03ba \u00b5i\nm\u03b5Kmin\n\n,\n\nmb\n\nminj\u2208J\u03b5,\u03ba \u03bdj\n\n\u03b5(n \u2212 nb) + \u03b5 \u2228 \u03ba maxi\u2208I\u03b5,\u03ba \u00b5i\n\nm\u03b5Kmin\n\n\u2264 evsc\n\nj \u2264 \u03b5\u03ba \u2228 maxj\u2208J\u03b5,\u03ba \u03bdj\n\nn\u03b5Kmin\n\nnb\n\n(7)\n\n(8)\n\n4 Theoretical analysis and guarantees\n\nThis section is devoted to establishing theoretical guarantees for SCREENKHORN algorithm. We \ufb01rst\nde\ufb01ne the screened marginals \u00b5sc = B(usc, vsc)1m and \u03bdsc = B(usc, vsc)(cid:62)1n. Our \ufb01rst theoretical\nresult, Proposition 2, gives an upper bound of the screened marginal violations with respect to\n(cid:96)1-norm.\nProposition 2. Let (usc, vsc) be an optimal pair solution of problem (6). Then one has\n\n(cid:107)\u00b5 \u2212 \u00b5sc(cid:107)2\n\nnbc\u03ba + (n \u2212 nb)\n\n+\n\nmb\n\n\u221a\nnmc\u00b5\u03bd K 3/2\nmin\n\n+\n\nm \u2212 mb\n\u221a\nnmKmin\n\n+ log\n\nnm\nmbc5/2\n\u00b5\u03bd\n\n(9)\n\nand\n(cid:107)\u03bd \u2212 \u03bdsc(cid:107)2\n, (10)\nwhere cz = z \u2212 log z \u2212 1 for z > 0 and c\u00b5\u03bd = \u00b5 \u2227 \u03bd with \u00b5 = mini\u2208I\u03b5,\u03ba \u00b5i and \u03bd = minj\u2208J\u03b5,\u03ba \u03bdj.\n\n\u221a\nnmc\u00b5\u03bd K 3/2\nmin\n\nn \u2212 nb\n\u221a\nnmKmin\n\n+ (m \u2212 mb)\n\nnm\nnbc5/2\n\u00b5\u03bd\n\nmbc 1\n\u03ba\n\n+ log\n\nnb\n\n+\n\n+\n\n\u03b7\n\n(cid:16) \u221a\n(cid:16) \u221a\n\n(cid:17)(cid:17)(cid:17)\n(cid:17)(cid:17)(cid:17)\n\n1 = O(cid:16)\n1 = O(cid:16)\n\n(cid:16)(cid:107)C(cid:107)\u221e\n(cid:16)(cid:107)C(cid:107)\u221e\n\n\u03b7\n\n5\n\n\fAlgorithm 1: SCREENKHORN(C, \u03b7, \u00b5, \u03bd, nb, mb)\nStep 1: Screening pre-processing\n\n1. \u03be \u2190 sort(\u00b5 (cid:11) r(K)), \u03b6 \u2190 sort(\u03bd (cid:11) c(K)); //(decreasing order)\n3. I\u03b5,\u03ba \u2190 {i = 1, . . . , n : \u00b5i \u2265 \u03b52\u03ba\u22121ri(K)}, J\u03b5,\u03ba \u2190 {j = 1, . . . , m : \u03bdj \u2265 \u03b52\u03bacj(K)};\n4. \u00b5 \u2190 mini\u2208I\u03b5,\u03ba \u00b5i, \u00af\u00b5 \u2190 maxi\u2208I\u03b5,\u03ba \u00b5i, \u03bd \u2190 minj\u2208J\u03b5,\u03ba \u03bdi, \u00af\u03bd \u2190 maxj\u2208J\u03b5,\u03ba \u03bdi;\n\n\u03b6mb /\u03benb;\n\n2. \u03b5 \u2190 (\u03benb \u03b6mb )1/4, \u03ba \u2190(cid:112)\n5. u \u2190 log(cid:0) \u03b5\n6. v \u2190 log(cid:0)\u03b5\u03ba \u2228\n\n\u03ba \u2228\n\n\u00b5\n\u03b5(m\u2212mb)+\u03b5\u2228\n\u03bd\n\u03b5(n\u2212nb)+\u03b5\u2228 \u03ba \u00af\u00b5\n\n\u00af\u03bd\n\nn\u03b5\u03baKmin\n\nmb\n\nnb\n\n(cid:1), \u00afu \u2190 log(cid:0) \u03b5\n(cid:1), \u00afv \u2190 log(cid:0)\u03b5\u03ba \u2228\n\n\u03ba \u2228\n\n(cid:1);\n(cid:1);\n\n\u00af\u00b5\n\nm\u03b5Kmin\n\n\u00af\u03bd\n\nn\u03b5Kmin\n\nm\u03b5Kmin\n\n7. \u00af\u03b8 \u2190 stack(\u00afu1nb , \u00afv1mb ), \u03b8 \u2190 stack(u1nb , v1mb );\nStep 2: L-BFGS-B solver on the screened variables\n\n8. u(0) \u2190 log(\u03b5\u03ba\u22121)1nb , v(0) \u2190 log(\u03b5\u03ba)1mb;\n9. \u02c6u, \u02c6v \u2190 RESTRICTED SINKHORN(u(0), u(0));\n10. \u03b8(0) \u2190 stack(\u02c6u, \u02c6v);\n11. \u03b8 \u2190 L-BFGS-B(\u03b8(0), \u03b8, \u00af\u03b8);\n12. \u03b8u \u2190 (\u03b81, . . . , \u03b8nb )(cid:62), \u03b8v \u2190 (\u03b8nb+1, . . . , \u03b8nb+mb )(cid:62);\ni \u2190 (\u03b8u)i if i \u2208 I\u03b5,\u03ba and ui \u2190 log(\u03b5\u03ba\u22121) if i \u2208 I\n13. usc\nj \u2190 (\u03b8v)j if j \u2208 J\u03b5,\u03ba and vj \u2190 log(\u03b5\u03ba) if j \u2208 J\n(cid:123)\n14. vsc\n\u03b5,\u03ba;\n15. return B(usc, vsc).\n\n(cid:123)\n\u03b5,\u03ba;\n\n\u03ba\n\nProof of Proposition 2 is presented in supplementary material and it is based on \ufb01rst order optimality\nconditions for problem (6) and on a generalization of Pinsker inequality (see Lemma 2 in supplemen-\ntary). Note that c\u03ba and c 1\ntend to zeros as \u03ba goes to one, which is the case when the number budget\nof points (nb, mb) tends to the full one (n, m).\nOur second theoretical result, Proposition 3, is an upper bound of the difference between objective\nvalues of SCREENKHORN and dual of Sinkhorn divergence (2).\nProposition 3. Let (usc, vsc) be an optimal pair solution of problem (6) and (u(cid:63), v(cid:63)) is the pair\nsolution of dual of Sinkhorn divergence (2). Then we have\n\n\u03a8\u03b5,\u03ba(usc, vsc) \u2212 \u03a8(u(cid:63), v(cid:63)) = O(cid:0)R((cid:107)\u00b5 \u2212 \u00b5sc(cid:107)1 + (cid:107)\u03bd \u2212 \u03bdsc(cid:107)1 + \u03c9\u03ba)(cid:1).\n\u03b7 + log(cid:0) (n\u2228m)2\n\n(cid:1) and \u03c9\u03ba = (1 \u2212 \u03ba)(cid:107)\u00b5sc(cid:107)1 + (\u03ba\u22121 \u2212 1)(cid:107)\u03bdsc(cid:107)1 + \u03ba\u22121 \u2212 \u03ba.\n\nwhere R =\n\n(cid:107)C(cid:107)\u221e\n\nnmc7/2\n\u00b5\u03bd\n\nProof of Proposition 3 is exposed in the supplementary material. Comparing to some other analysis\nresults of this quantity, see for instance Lemma 2 in [13] and Lemma 3.1 in [30], our bound involves\nan additional term \u03c9\u03ba (with \u03c91 = 0), that tends to zero as the pair budget (nb, mb) goes to the\nfull number budget of points (n, m) (i.e., \u03ba goes to 1). To better characterize \u03c9\u03ba, a control of the\n(cid:96)1-norms of the screened marginals \u00b5sc and \u03bdsc are given in Lemma 3 in the supplementary material.\n\n5 Numerical experiments\n\nIn this section, we present some numerical analyses of our SCREENKHORN algorithm and show how\nit behaves when integrated into some complex machine learning pipelines.\n\n5.1 Setup\n\nWe have implemented our SCREENKHORN algorithm in Python and used the L-BFGS-B of Scipy. Re-\ngarding the machine-learning based comparison, we have based our code on the ones of Python Opti-\nmal Transport toolbox (POT) [16] and just replaced the sinkhorn function call with a screenkhorn\none. We have considered the POT\u2019s default SINKHORN stopping criterion parameters and for\nSCREENKHORN, the L-BFGS-B algorithm is stopped when the largest component of the projected\ngradient is smaller than 10\u22126, when the number of iterations or the number of objective function\nevaluations reach 105. For all applications, we have set \u03b7 = 1 unless otherwise speci\ufb01ed.\n\n6\n\n\fFigure 3: Empirical evaluation of SCREENKHORN vs SINKHORN for normalized cost matrix i.e.\n(cid:107)C(cid:107)\u221e = 1. (most-lefts): marginal violations in relation with the budget of points on n and m .\n(center-right) ratio of computation times TSINKHORN\nand, (right) relative divergence variation. The\nTSCREENKHORN\nresults are averaged over 30 trials.\n\n5.2 Analysing on toy problem\n\n)K\u2206(ev(cid:63)\n\nTSINKHORN\n\nTSCREENKHORN\n\n) and P sc = \u2206(eusc\n\nWe compare SCREENKHORN to SINKHORN as implemented in POT toolbox1 on a synthetic example.\nThe dataset we use consists of source samples generated from a bi-dimensional gaussian mixture\nand target samples following the same distribution but with different gaussian means. We consider\nan unsupervised domain adaptation using optimal transport with entropic regularization. Several\nsettings are explored: different values of \u03b7, the regularization parameter, the allowed budget nb\nn = mb\nm\nranging from 0.01 to 0.99, different values of n and m. We empirically measure marginal violations\nas the norms (cid:107)\u00b5 \u2212 \u00b5sc(cid:107)1 and (cid:107)\u03bd \u2212 \u03bdsc(cid:107)1, running time expressed as\nand the relative\ndivergence difference |(cid:104)C, P (cid:63)(cid:105)\u2212(cid:104)C, P sc(cid:105)|/(cid:104)C, P (cid:63)(cid:105) between SCREENKHORN and SINKHORN, where\nP (cid:63) = \u2206(eu(cid:63)\n). Figure 3 summarizes the observed behaviors\nof both algorithms under these settings. We choose to only report results for n = m = 1000 as we\nget similar \ufb01ndings for other values of n and m.\nSCREENKHORN provides good approximation of the marginals \u00b5 and \u03bd for \u201chigh\u201d values of the\nregularization parameter \u03b7 (\u03b7 > 1). The approximation quality diminishes for small \u03b7. As expected\n(cid:107)\u00b5 \u2212 \u00b5sc(cid:107)1 and (cid:107)\u03bd \u2212 \u03bdsc(cid:107)1 converge towards zero when increasing the budget of points. Remarkably\nmarginal violations are almost negligible whatever the budget for high \u03b7. According to computation\ngain, SCREENKHORN is almost 2 times faster than SINKHORN at high decimation factor n/nb (low\nbudget) while the reverse holds when n/nb gets close to 1. Computational bene\ufb01t of SCREENKHORN\nalso depends on \u03b7 with appropriate values \u03b7 \u2264 1. Finally except for \u03b7 = 0.1 SCREENKHORN achieves\na divergence (cid:104)C, P(cid:105) close to the one of Sinkhorn showing that our static screening test provides a\nreasonable approximation of the Sinkhorn divergence. As such, we believe that SCREENKHORN will\nbe practically useful in cases where modest accuracy on the divergence is suf\ufb01cient. This may be the\ncase of a loss function for a gradient descent method (see next section).\n\n)K\u2206(evsc\n\n5.3\n\nIntegrating SCREENKHORN into machine learning pipelines\n\nHere, we analyse the impact of using SCREENKHORN instead of SINKHORN in a complex machine\nlearning pipeline. Our two applications are a dimensionality reduction technique, denoted as Wasser-\nstein Discriminant Analysis (WDA), based on Wasserstein distance approximated through Sinkhorn\ndivergence [17] and a domain-adaptation using optimal transport mapping [10], named OTDA.\nWDA aims at \ufb01nding a linear projection which minimize the ratio of distance between intra-class\nsamples and distance inter-class samples, where the distance is understood in a Sinkhorn divergence\nsense. We have used a toy problem involving Gaussian classes with 2 discriminative features\nand 8 noisy features and the MNIST dataset. For the former problem, we aim at \ufb01nd the best\ntwo-dimensional linear subspace in a WDA sense whereas for MNIST, we look for a subspace\nof dimension 20 starting from the original 728 dimensions. Quality of the retrieved subspace are\nevaluated using classi\ufb01cation task based on a 1-nearest neighbour approach.\nFigure 4 presents the average gain (over 30 trials) in computational time we get as the number of\nexamples evolve and for different decimation factors of the SCREENKHORN problem. Analysis of\n\n1https://pot.readthedocs.io/en/stable/index.html\n\n7\n\n1.11.2525102050100Decimation factor n/nb103102101100||sc||1n=m=1000=0.1=0.5=1=101.11.2525102050100Decimation factor mb/m103102101100||sc||1n=m=1000=0.1=0.5=1=101.11.2525102050100Decimation factor n/nb0.51.01.52.02.5Running Time Gainn=m=1000=0.1=0.5=1=101.11.2525102050100Decimation factor n/nb103102101100Relative Divergence Variationn=m=1000=0.1=0.5=1=10\fFigure 4: Wasserstein Discriminant Analysis : running time gain for (left) a toy dataset and (right)\nMNIST as a function of the number of examples and the data decimation factor in SCREENKHORN.\n\nFigure 5: OT Domain adaptation : running time gain for MNIST as a function of the number of\nexamples and the data decimation factor in SCREENKHORN. Group-lasso hyperparameter values\n(left) 1. (right) 10.\n\nthe quality of the subspace have been deported to the supplementary material (see Figure 7), but we\ncan remark a small loss of performance of SCREENKHORN for the toy problem, while for MNIST,\naccuracies are equivalent regardless of the decimation factor. We can note that the minimal gains are\nrespectively 2 and 4.5 for the toy and MNIST problem whereas the maximal gain for 4000 samples\nis slightly larger than an order of magnitude.\nFor the OT based domain adaptation problem, we have considered the OTDA with (cid:96) 1\n2 ,1 group-lasso\nregularizer that helps in exploiting available labels in the source domain. The problem is solved using\na majorization-minimization approach for handling the non-convexity of the problem. Hence, at each\niteration, a SINKHORN/SCREENKHORN has to be computed and the number of iteration is sensitive to\nthe regularizer strength. As a domain-adaptation problem, we have used a MNIST to USPS problem\nin which features have been computed from the \ufb01rst layers of a domain adversarial neural networks\n[19] before full convergence of the networks (so as to leave room for OT adaptation). Figure 5 reports\nthe gain in running time for 2 different values of the group-lasso regularizer hyperparameter, while\nthe curves of performances are reported in the supplementary material. We can note that for all the\nSCREENKHORN with different decimation factors, the gain in computation goes from a factor of 4 to\n12, without any loss of the accuracy performance.\n\n6 Conclusion\n\nThe paper introduces a novel ef\ufb01cient approximation of the Sinkhorn divergence based on a screening\nstrategy. Screening some of the Sinkhorn dual variables has been made possible by de\ufb01ning a novel\nconstrained dual problem and by carefully analyzing its optimality conditions. From the latter, we\nderived some suf\ufb01cient conditions depending on the ground cost matrix, that some dual variables are\nsmaller than a given threshold. Hence, we need just to solve a restricted dual Sinkhorn problem using\nan off-the-shelf L-BFGS-B algorithm. We also provide some theoretical guarantees of the quality\nof the approximation with respect to the number of variables that have been screened. Numerical\nexperiments show the behaviour of our SCREENKHORN algorithm and computational time gain it\ncan achieve when integrated in some complex machine learning pipelines.\n\n8\n\n010002000300040005000Number of samples24681012Running Time GainScreened WDA on toydec=1.5dec=2dec=5dec=10dec=20dec=50dec=1005001000150020002500300035004000Number of samples510152025Running Time GainScreened WDA on mnistdec=1.5dec=2dec=5dec=10dec=20dec=50dec=1005001000150020002500300035004000Number of samples46810Running Time GainScreened OTDA on mnistdec=1.5dec=2dec=5dec=10dec=20dec=50dec=1005001000150020002500300035004000Number of samples4681012Running Time GainScreened OTDA on mnistdec=1.5dec=2dec=5dec=10dec=20dec=50dec=100\fAcknowledgments\n\nThis work was supported by grants from the Normandie Projet GRR-DAISI, European funding\nFEDER DAISI and OATMIL ANR-17-CE23-0012 Project of the French National Research Agency\n(ANR).\n\nReferences\n[1] B. K. Abid and R. Gower. Stochastic algorithms for entropy-regularized optimal transport problems. In\nAmos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference\non Arti\ufb01cial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages\n1505\u20131512, Playa Blanca, Lanzarote, Canary Islands, 2018. PMLR.\n\n[2] J. Altschuler, F. Bach, A. Rudi, and J. Weed. Massively scalable Sinkhorn distances via the Nystr\u00f6m\n\nmethod, 2018.\n\n[3] J. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via\nSinkhorn iteration. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1964\u20131974. Curran\nAssociates, Inc., 2017.\n\n[4] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Doina Precup and\nYee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70\nof Proceedings of Machine Learning Research, pages 214\u2013223, International Convention Centre, Sydney,\nAustralia, 2017. PMLR.\n\n[5] J. D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr\u00e9. Iterative bregman projections for regularized\n\ntransportation problems. SIAM J. Scienti\ufb01c Computing, 37, 2015.\n\n[6] J. Bigot, R. Gouet, T. Klein, and A. L\u00f3pez. Geodesic PCA in the Wasserstein space by convex PCA. Ann.\n\nInst. H. Poincar\u00e9 Probab. Statist., 53(1):1\u201326, 2017.\n\n[7] M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. In Amos Storkey and Fernando\nPerez-Cruz, editors, Proceedings of the Twenty-First International Conference on Arti\ufb01cial Intelligence\nand Statistics, volume 84 of Proceedings of Machine Learning Research, pages 880\u2013889, Playa Blanca,\nLanzarote, Canary Islands, 2018. PMLR.\n\n[8] N. Bonneel, M. van de Panne, S. Paris, and W. Heidrich. Displacement interpolation using Lagrangian\n\nmass transport. ACM Trans. Graph., 30(6):158:1\u2013158:12, 2011.\n\n[9] R. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization.\n\nSIAM Journal on Scienti\ufb01c Computing, 16(5):1190\u20131208, 1995.\n\n[10] Nicolas Courty, R\u00e9mi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain\n\nadaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853\u20131865, 2017.\n\n[11] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges, L. Bottou,\nM. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing\nSystems 26, pages 2292\u20132300. Curran Associates, Inc., 2013.\n\n[12] M. Cuturi and G. Peyr\u00e9. A smoothed dual approach for variational Wasserstein problems. SIAM Journal\n\non Imaging Sciences, 9(1):320\u2013343, 2016.\n\n[13] P. Dvurechensky, A. Gasnikov, and A. Kroshnin. Computational optimal transport: Complexity by\naccelerated gradient descent is better than by Sinkhorn\u2019s algorithm. In Jennifer Dy and Andreas Krause,\neditors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings\nof Machine Learning Research, pages 1367\u20131376, Stockholmsm\u00e4ssan, Stockholm Sweden, 2018. PMLR.\n\n[14] J. Ebert, V. Spokoiny, and A. Suvorikova. Construction of non-asymptotic con\ufb01dence sets in 2-Wasserstein\n\nspace, 2017.\n\n[15] Y. Fei, G. Rong, B. Wang, and W. Wang. Parallel L-BFGS-B algorithm on GPU. Computers & Graphics,\n\n40:1 \u2013 9, 2014.\n\n[16] R. Flamary and N. Courty. POT: Python optimal transport library, 2017.\n\n[17] R. Flamary, M. Cuturi, N. Courty, and A. Rakotomamonjy. Wasserstein discriminant analysis. Machine\n\nLearning, 107(12):1923\u20131945, 2018.\n\n9\n\n\f[18] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio. Learning with a Wasserstein loss.\nIn C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 28, pages 2053\u20132061. Curran Associates, Inc., 2015.\n\n[19] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky.\nDomain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096\u2013\n2030, 2016.\n\n[20] A. Genevay, M. Cuturi, G. Peyr\u00e9, and F. Bach. Stochastic optimization for large-scale optimal transport. In\nD. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 29, pages 3440\u20133448. Curran Associates, Inc., 2016.\n\n[21] L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. CoRR,\n\nabs/1009.4219, 2010.\n\n[22] N. Ho, X. L. Nguyen, M. Yurochkin, H. H. Bui, V. Huynh, and D. Phung. Multilevel clustering via\nWasserstein means. In Proceedings of the 34th International Conference on Machine Learning - Volume\n70, ICML\u201917, pages 1501\u20131509. JMLR.org, 2017.\n\n[23] B. Kalantari, I. Lari, F. Ricca, and B. Simeone. On the complexity of general matrix scaling and entropy\n\nminimization via the ras algorithm. Mathematical Programming, 112(2):371\u2013401, 2008.\n\n[24] B. Kalantari and L.Khachiyan. On the complexity of nonnegative-matrix scaling. Linear Algebra and its\n\nApplications, 240:87 \u2013 103, 1996.\n\n[25] L. Kantorovich. On the transfer of masses (in russian). Doklady Akademii Nauk, 2:227\u2013229, 1942.\n\n[26] P. Knight. The Sinkhorn\u2013Knopp algorithm: Convergence and applications. SIAM Journal on Matrix\n\nAnalysis and Applications, 30(1):261\u2013275, 2008.\n\n[27] S. Kolouri, S. R. Park, M. Thorpe, D. Slepcev, and G. K. Rohde. Optimal mass transport: Signal processing\n\nand machine-learning applications. IEEE Signal Processing Magazine, 34(4):43\u201359, 2017.\n\n[28] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. From word embeddings to document distances. In\nFrancis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine\nLearning, volume 37 of Proceedings of Machine Learning Research, pages 957\u2013966, Lille, France, 2015.\nPMLR.\n\n[29] Y. T. Lee and A. Sidford. Path \ufb01nding methods for linear programming: Solving linear programs in\n\u00d5(vrank) iterations and faster algorithms for maximum \ufb02ow. In Proceedings of the 2014 IEEE 55th Annual\nSymposium on Foundations of Computer Science, FOCS \u201914, pages 424\u2013433, Washington, DC, USA, 2014.\nIEEE Computer Society.\n\n[30] T. Lin, N. Ho, and M. I. Jordan. On ef\ufb01cient optimal transport: An analysis of greedy and accelerated\n\nmirror descent algorithms. CoRR, abs/1901.06482, 2019.\n\n[31] J. Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of Computation,\n\n35(151):773\u2013782, 1980.\n\n[32] V. M. Panaretos and Y. Zemel. Amplitude and phase variation of point processes. Ann. Statist., 44(2):771\u2013\n\n812, 2016.\n\n[33] G. Peyr\u00e9 and M. Cuturi. Computational optimal transport. Foundations and Trends R(cid:13) in Machine Learning,\n\n11(5-6):355\u2013607, 2019.\n\n[34] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover\u2019s distance as a metric for image retrieval.\n\nInternational Journal of Computer Vision, 40(2):99\u2013121, 2000.\n\n[35] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American\n\nMathematical Monthly, 74(4):402\u2013405, 1967.\n\n[36] J. Solomon, F. de Goes, G. Peyr\u00e9, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas. Convolutional\nWasserstein distances: Ef\ufb01cient optimal transportation on geometric domains. ACM Trans. Graph.,\n34(4):66:1\u201366:11, 2015.\n\n[37] J. Solomon, R. Rustamov, L. Guibas, and A. Butscher. Wasserstein propagation for semi-supervised\nlearning. In Eric P. Xing and Tony Jebara, editors, Proceedings of the 31st International Conference on\nMachine Learning, volume 32 of Proceedings of Machine Learning Research, pages 306\u2013314, Bejing,\nChina, 2014. PMLR.\n\n10\n\n\f[38] C. Villani. Optimal Transport: Old and New, volume 338 of Grundlehren der mathematischen Wis-\n\nsenschaften. Springer Berlin Heidelberg, 2009.\n\n[39] M. Werman, S. Peleg, and A. Rosenfeld. A distance metric for multidimensional histograms. Computer\n\nVision, Graphics, and Image Processing, 32(3):328 \u2013 336, 1985.\n\n[40] Y. Xie, X.Wang, R. Wang, and H. Zha. A fast proximal point method for computing Wasserstein distance,\n\n2018.\n\n11\n\n\f", "award": [], "sourceid": 6597, "authors": [{"given_name": "Mokhtar Z.", "family_name": "Alaya", "institution": "LITIS Lab, University of Rouen"}, {"given_name": "Maxime", "family_name": "Berar", "institution": "Universit\u00e9 de Rouen"}, {"given_name": "Gilles", "family_name": "Gasso", "institution": "LITIS - INSA de Rouen"}, {"given_name": "Alain", "family_name": "Rakotomamonjy", "institution": "Universit\u00e9 de Rouen Normandie   Criteo AI Lab"}]}