{"title": "Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 1964, "page_last": 1974, "abstract": "Computing optimal transport distances such as the earth mover's distance is a fundamental problem in machine learning, statistics, and computer vision. Despite the recent introduction of several algorithms with good empirical performance, it is unknown whether general optimal transport distances can be approximated in near-linear time. This paper demonstrates that this ambitious goal is in fact achieved by Cuturi's Sinkhorn Distances. This result relies on a new analysis of Sinkhorn iterations, which also directly suggests a new greedy coordinate descent algorithm Greenkhorn with the same theoretical guarantees. Numerical simulations illustrate that Greenkhorn significantly outperforms the classical Sinkhorn algorithm in practice.", "full_text": "Near-linear time approximation algorithms for\n\noptimal transport via Sinkhorn iteration\n\nJason Altschuler\n\nMIT\n\njasonalt@mit.edu\n\nJonathan Weed\n\nMIT\n\njweed@mit.edu\n\nPhilippe Rigollet\n\nMIT\n\nrigollet@mit.edu\n\nAbstract\n\nComputing optimal transport distances such as the earth mover\u2019s distance is a\nfundamental problem in machine learning, statistics, and computer vision. Despite\nthe recent introduction of several algorithms with good empirical performance,\nit is unknown whether general optimal transport distances can be approximated\nin near-linear time. This paper demonstrates that this ambitious goal is in fact\nachieved by Cuturi\u2019s Sinkhorn Distances. This result relies on a new analysis\nof Sinkhorn iterations, which also directly suggests a new greedy coordinate\ndescent algorithm GREENKHORN with the same theoretical guarantees. Numerical\nsimulations illustrate that GREENKHORN signi\ufb01cantly outperforms the classical\nSINKHORN algorithm in practice.\n\nDedicated to the memory of Michael B. Cohen\n\n1\n\nIntroduction\n\nComputing distances between probability measures on metric spaces, or more generally between point\nclouds, plays an increasingly preponderant role in machine learning [SL11, MJ15, LG15, JSCG16,\nACB17], statistics [FCCR16, PZ16, SR04, BGKL17] and computer vision [RTG00, BvdPPH11,\nSdGP+15]. A prominent example of such distances is the earth mover\u2019s distance introduced\nin [WPR85] (see also [RTG00]), which is a special case of Wasserstein distance, or optimal transport\n(OT) distance [Vil09].\nWhile OT distances exhibit a unique ability to capture geometric features of the objects at hand, they\nsuffer from a heavy computational cost that had been prohibitive in large scale applications until the\nrecent introduction to the machine learning community of Sinkhorn Distances by Cuturi [Cut13].\nCombined with other numerical tricks, these recent advances have enabled the treatment of large\npoint clouds in computer graphics such as triangle meshes [SdGP+15] and high-resolution neu-\nroimaging data [GPC15]. Sinkhorn Distances rely on the idea of entropic penalization, which has\nbeen implemented in similar problems at least since Schr\u00f6dinger [Sch31, Leo14]. This powerful\nidea has been successfully applied to a variety of contexts not only as a statistical tool for model\nselection [JRT08, RT11, RT12] and online learning [CBL06], but also as an optimization gadget in\n\ufb01rst-order optimization methods such as mirror descent and proximal methods [Bub15].\n\nRelated work. Computing an OT distance amounts to solving the following linear system:\n\nUr,c :=(cid:8)P \u2208 IRn\u00d7n\n\n+\n\n: P 1 = r , P (cid:62)1 = c(cid:9) ,\n\n(cid:104)P, C(cid:105) ,\n\nmin\nP\u2208Ur,c\n\n(1)\nwhere 1 is the all-ones vector in IRn, C \u2208 IRn\u00d7n\nis a given cost matrix, and r \u2208 IRn, c \u2208 IRn are\ngiven vectors with positive entries that sum to one. Typically C is a matrix containing pairwise\n\n+\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdistances (and is thus dense), but in this paper we allow C to be an arbitrary non-negative dense\nmatrix with bounded entries since our results are more general. For brevity, this paper focuses on\nsquare matrices C and P , since extensions to the rectangular case are straightforward.\nThis paper is at the intersection of two lines of research: a theoretical one that aims at \ufb01nding (near)\nlinear time approximation algorithms for simple problems that are already known to run in polynomial\ntime and a practical one that pursues fast algorithms for solving optimal transport approximately for\nlarge datasets.\nNoticing that (1) is a linear program with O(n) linear constraints and certain graphical structure, one\n\ncan use the recent Lee-Sidford linear solver to \ufb01nd a solution in time (cid:101)O(n2.5) [LS14], improving over\n\nthe previous standard of O(n3.5) [Ren88]. While no practical implementation of the Lee-Sidford\nalgorithm is known, it provides a theoretical benchmark for our methods. Their result is part of a long\nline of work initiated by the seminal paper of Spielman and Teng [ST04] on solving linear systems\nof equations, which has provided a building block for near-linear time approximation algorithms\nin a variety of combinatorially structured linear problems. A separate line of work has focused on\nobtaining faster algorithms for (1) by imposing additional assumptions. For instance, [AS14] obtain\napproximations to (1) when the cost matrix C arises from a metric, but their running times are not\ntruly near-linear. [SA12,ANOY14] develop even faster algorithms for (1), but require C to arise from\na low-dimensional (cid:96)p metric.\nPractical algorithms for computing OT distances include Orlin\u2019s algorithm for the Uncapacitated\nMinimum Cost Flow problem via a standard reduction. Like interior point methods, it has a provable\ncomplexity of O(n3 log n). This dependence on the dimension is also observed in practice, thereby\npreventing large-scale applications. To overcome the limitations of such general solvers, various\nideas ranging from graph sparsi\ufb01cation [PW09] to metric embedding [IT03, GD04, SJ08] have been\nproposed over the years to deal with particular cases of OT distance.\nOur work complements both lines of work, theoretical and practical, by providing the \ufb01rst near-linear\ntime guarantee to approximate (1) for general non-negative cost matrices. Moreover we show that\nthis performance is achieved by algorithms that are also very ef\ufb01cient in practice. Central to our\ncontribution are recent developments of scalable methods for general OT that leverage the idea of\nentropic regularization [Cut13, BCC+15, GCPB16]. However, the apparent practical ef\ufb01cacy of these\napproaches came without theoretical guarantees. In particular, showing that this regularization yields\nan algorithm to compute or approximate general OT distances in time nearly linear in the input size\nn2 was an open question before this work.\n\nOur contribution. The contribution of this paper is twofold. First we demonstrate that, with an\nappropriate choice of parameters, the algorithm for Sinkhorn Distances introduced in [Cut13] is\nin fact a near-linear time approximation algorithm for computing OT distances between discrete\nmeasures. This is the \ufb01rst proof that such near-linear time results are achievable for optimal transport.\nWe also provide previously unavailable guidance for parameter tuning in this algorithm. Core to\nour work is a new and arguably more natural analysis of the Sinkhorn iteration algorithm, which we\nshow converges in a number of iterations independent of the dimension n of the matrix to balance. In\nparticular, this analysis directly suggests a greedy variant of Sinkhorn iteration that also provably\nruns in near-linear time and signi\ufb01cantly outperforms the classical algorithm in practice. Finally,\nwhile most approximation algorithms output an approximation of the optimum value of the linear\nprogram (1), we also describe a simple, parallelizable rounding algorithm that provably outputs a\nfeasible solution to (1). Speci\ufb01cally, for any \u03b5 > 0 and bounded, non-negative cost matrix C, we\n\ndescribe an algorithm that runs in time (cid:101)O(n2/\u03b53) and outputs \u02c6P \u2208 Ur,c such that\n\n(cid:104) \u02c6P , C(cid:105) \u2264 min\nP\u2208Ur,c\n\n(cid:104)P, C(cid:105) + \u03b5\n\nWe emphasize that our analysis does not require the cost matrix C to come from an underlying metric;\nwe only require C to be non-negative. This implies that our results also give, for example, near-linear\ntime approximation algorithms for Wasserstein p-distances between discrete measures.\nNotation. We denote non-negative real numbers by IR+, the set of integers {1, . . . , n} by [n], and\nthe n-dimensional simplex by \u2206n := {x \u2208 IRn\ni=1 xi = 1}. For two probability distributions\np, q \u2208 \u2206n such that p is absolutely continuous w.r.t. q, we de\ufb01ne the entropy H(p) of p and the\n\n+ : (cid:80)n\n\n2\n\n\fH(p) =\n\n(cid:19)\n\n(cid:18) 1\n\n(cid:19)\n(cid:18) pi\nn(cid:88)\n+ , we de\ufb01ne the entropy H(P ) entrywise as(cid:80)\n\nK(p(cid:107)q) :=\n\nn(cid:88)\n\npi log\n\npi log\n\ni=1\n\n,\n\npi\n\n.\n\nqi\n\ni=1\n\nKullback-Leibler divergence K(p(cid:107)q) between p and q respectively by\n\nSimilarly, for a matrix P \u2208 IRn\u00d7n\n. We\nuse 1 and 0 to denote the all-ones and all-zeroes vectors in IRn. For a matrix A = (Aij), we denote\nby exp(A) the matrix with entries (eAij ). For A \u2208 IRn\u00d7n, we denote its row and columns sums\n(cid:107)A(cid:107)1 =(cid:80)\nby r(A) := A1 \u2208 IRn and c(A) := A(cid:62)1 \u2208 IRn, respectively. The coordinates ri(A) and cj(A)\ndenote the ith row sum and jth column sum of A, respectively. We write (cid:107)A(cid:107)\u221e = maxij |Aij| and\nof A and B by (cid:104)A, B(cid:105) =(cid:80)\nij |Aij|. For two matrices of the same dimension, we denote the Frobenius inner product\nij AijBij. For a vector x \u2208 IRn, we write D(x) \u2208 IRn\u00d7n to denote the\nwrite un = (cid:101)O(vn) if there exist positive constants C, c such that un \u2264 Cvn(log n)c. For any two\ndiagonal matrix with entries (D(x))ii = xi. For any two nonnegative sequences (un)n, (vn)n, we\nreal numbers, we write a \u2227 b = min(a, b).\n\nij Pij log 1\nPij\n\n2 Optimal Transport in near-linear time\n\n\u03b5\n\n8(cid:107)C(cid:107)\u221e\n\nAlgorithm 1 APPROXOT(C, r, c, \u03b5)\n\n\u03b7 \u2190 4 log n\n\\\\ Step 1: Approximately project onto Ur,c\n\n, \u03b5(cid:48) \u2190 \u03b5\n\n1: A \u2190 exp(\u2212\u03b7C)\n2: B \u2190 PROJ(A,Ur,c, \u03b5(cid:48))\n\nIn this section, we describe the main algorithm studied in this paper. Pseudocode appears in\nAlgorithm 1.\nThe core of our algorithm is the computation of\nan approximate Sinkhorn projection of the matrix\nA = exp(\u2212\u03b7C) (Step 1), details for which will\nbe given in Section 3. Since our approximate\nSinkhorn projection is not guaranteed to lie in\nthe feasible set, we round our approximation to\nensure that it lies in Ur,c (Step 2). Pseudocode\nfor a simple, parallelizable rounding procedure is\ngiven in Algorithm 2.\nAlgorithm 1 hinges on two subroutines: PROJ\nand ROUND. We give two algorithms for PROJ:\nSINKHORN and GREENKHORN. We devote Sec-\ntion 3 to their analysis, which is of independent\ninterest. On the other hand, ROUND is fairly sim-\nple. Its analysis is postponed to Section 4.\nOur main theorem about Algorithm 1 is the follow-\ning accuracy and runtime guarantee. The proof\nis postponed to Section 4, since it relies on the\nanalysis of PROJ and ROUND.\nTheorem 1. Algorithm 1 returns a point \u02c6P \u2208 Ur,c satisfying\n\nAlgorithm 2 ROUND(F,Ur,c)\n1: X \u2190 D(x) with xi = ri\n2: F (cid:48) \u2190 XF\n3: Y \u2190 D(y) with yj = cj\n4: F (cid:48)(cid:48) \u2190 F (cid:48)Y\n5: errr \u2190 r \u2212 r(F (cid:48)(cid:48)), errc \u2190 c \u2212 c(F (cid:48)(cid:48))\nc /(cid:107)errr(cid:107)1\n6: Output G \u2190 F (cid:48)(cid:48) + errrerr(cid:62)\n\n\\\\ Step 2: Round to feasible point in Ur,c\n\n3: Output \u02c6P \u2190 ROUND(B,Ur,c)\n\nri(F ) \u2227 1\ncj (F (cid:48)) \u2227 1\n\n(cid:104) \u02c6P , C(cid:105) \u2264 min\nP\u2208Ur,c\n\n(cid:104)P, C(cid:105) + \u03b5\n\nin time O(n2 + S), where S is the running time of the subroutine PROJ(A,Ur,c, \u03b5(cid:48)). In particular,\nif (cid:107)C(cid:107)\u221e \u2264 L, then S can be O(n2L3(log n)\u03b5\u22123), so that Algorithm 1 runs in O(n2L3(log n)\u03b5\u22123)\ntime.\nRemark 1. The time complexity in the above theorem re\ufb02ects only elementary arithmetic operations.\nIn the interest of clarity, we ignore questions of bit complexity that may arise from taking exponentials.\nThe effect of this simpli\ufb01cation is marginal since it can be easily shown [KLRS08] that the maximum\nbit complexity throughout the iterations of our algorithm is O(L(log n)/\u03b5). As a result, factoring in\nbit complexity leads to a runtime of O(n2L4(log n)2\u03b5\u22124), which is still truly near-linear.\n\n3\n\n\f3 Linear-time approximate Sinkhorn projection\n\nThe core of our OT algorithm is the entropic penalty proposed by Cuturi [Cut13]:\n\n(cid:8)(cid:104)P, C(cid:105) \u2212 \u03b7\u22121H(P )(cid:9) .\n\nP\u03b7 := argmin\nP\u2208Ur,c\n\n(2)\n\nThe solution to (2) can be characterized explicitly by analyzing its \ufb01rst-order conditions for optimality.\n[Cut13] For any cost matrix C and r, c \u2208 \u2206n, the minimization program (2) has a\nLemma 1.\nunique minimum at P\u03b7 \u2208 Ur,c of the form P\u03b7 = XAY , where A = exp(\u2212\u03b7C) and X, Y \u2208 IRn\u00d7n\nare both diagonal matrices. The matrices (X, Y ) are unique up to a constant factor.\nWe call the matrix P\u03b7 appearing in Lemma 1 the Sinkhorn projection of A, denoted \u03a0S (A,Ur,c),\nafter Sinkhorn, who proved uniqueness in [Sin67]. Computing \u03a0S (A,Ur,c) exactly is impractical, so\nwe implement instead an approximate version PROJ(A,Ur,c, \u03b5(cid:48)), which outputs a matrix B = XAY\nthat may not lie in Ur,c but satis\ufb01es the condition (cid:107)r(B) \u2212 r(cid:107)1 + (cid:107)c(B) \u2212 c(cid:107)1 \u2264 \u03b5(cid:48). We stress that\nthis condition is very natural from a statistical standpoint, since it requires that r(B) and c(B) are\nclose to the target marginals r and c in total variation distance.\n\n+\n\n3.1 The classical Sinkhorn algorithm\n\nGiven a matrix A, Sinkhorn proposed a simple iterative algorithm to approximate the Sinkhorn\nprojection \u03a0S (A,Ur,c), which is now known as the Sinkhorn-Knopp algorithm or RAS method.\nDespite the simplicity of this algorithm and its good performance in practice, it has been dif\ufb01cult\nto analyze. As a result, recent work showing that \u03a0S (A,Ur,c) can be approximated in near-linear\ntime [AZLOW17, CMTV17] has bypassed the Sinkhorn-Knopp algorithm entirely1. In our work, we\nobtain a new analysis of the simple and practical Sinkhorn-Knopp algorithm, showing that it also\napproximates \u03a0S (A,Ur,c) in near-linear time.\nPseudocode for the Sinkhorn-Knopp algorithm ap-\npears in Algorithm 3. In brief, it is an alternating\nprojection procedure which renormalizes the rows\nand columns of A in turn so that they match the de-\nsired row and column marginals r and c. At each\nstep, it prescribes to either modify all the rows by\nmultiplying row i by ri/ri(A) for i \u2208 [n], or to\ndo the analogous operation on the columns. (We\ninterpret the quantity 0/0 as 1 in this algorithm if\never it occurs.) The algorithm terminates when the\nmatrix A(k) is suf\ufb01ciently close to the polytope\nUr,c.\n\nAlgorithm 3 SINKHORN(A,Ur,c, \u03b5(cid:48))\n1: Initialize k \u2190 0\n2: A(0) \u2190 A/(cid:107)A(cid:107)1, x0 \u2190 0, y0 \u2190 0\n3: while dist(A(k),Ur,c) > \u03b5(cid:48) do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: Output B \u2190 A(k)\n\nk \u2190 k + 1\nif k odd then\nxi \u2190 log\nri(A(k\u22121)) for i \u2208 [n]\nxk \u2190 xk\u22121 + x, yk \u2190 yk\u22121\ny \u2190 log\ncj (A(k\u22121)) for j \u2208 [n]\nyk \u2190 yk\u22121 + y, xk \u2190 xk\u22121\n\nA(k) = D(exp(xk))AD(exp(yk))\n\n3.2 Prior work\n\nBefore this work, the best analysis of Algorithm 3\n\nshowed that (cid:101)O((\u03b5(cid:48))\u22122) iterations suf\ufb01ce to obtain a matrix close to Ur,c in (cid:96)2 distance:\nin O(cid:0)\u03c1(\u03b5(cid:48))\u22122 log(s/(cid:96))(cid:1) iterations, where s = (cid:80)\n\n[KLRS08] Let A be a strictly positive matrix. Algorithm 3 with dist(A,Ur,c) =\nProposition 1.\n(cid:107)r(A) \u2212 r(cid:107)2 + (cid:107)c(A) \u2212 c(cid:107)2 outputs a matrix B satisfying (cid:107)r(B) \u2212 r(cid:107)2 + (cid:107)c(B) \u2212 c(cid:107)2 \u2264 \u03b5(cid:48)\nij Aij, (cid:96) = minij Aij, and \u03c1 > 0 is such that\nri, ci \u2264 \u03c1 for all i \u2208 [n].\nUnfortunately, this analysis is not strong enough to obtain a true near-linear time guarantee. Indeed,\nthe (cid:96)2 norm is not an appropriate measure of closeness between probability vectors, since very\ndifferent distributions on large alphabets can nevertheless have small (cid:96)2 distance: for example,\n\n(n\u22121, . . . , n\u22121, 0, . . . , 0) and (0, . . . , 0, n\u22121, . . . , n\u22121) in \u22062n have (cid:96)2 distance(cid:112)2/n even though\n\nelse\n\nri\n\ncj\n\n1Replacing the PROJ step in Algorithm 1 with the matrix-scaling algorithm developed in [CMTV17] results\nin a runtime that is a single factor of \u03b5 faster than what we present in Theorem 1. The bene\ufb01t of our approach is\nthat it is extremely easy to implement, whereas the matrix-scaling algorithm of [CMTV17] relies heavily on\nnear-linear time Laplacian solver subroutines, which are not implementable in practice.\n\n4\n\n\fthey have disjoint support. As noted above, for statistical problems, including computation of the OT\ndistance, it is more natural to measure distance in (cid:96)1 norm.\nThe following Corollary gives the best (cid:96)1 guarantee available from Proposition 1.\nCorollary 1. Algorithm 3 with dist(A,Ur,c) = (cid:107)r(A) \u2212 r(cid:107)2 + (cid:107)c(A) \u2212 c(cid:107)2 outputs a matrix B\n\nsatisfying (cid:107)r(B) \u2212 r(cid:107)1 + (cid:107)c(B) \u2212 c(cid:107)1 \u2264 \u03b5(cid:48) in O(cid:0)n\u03c1(\u03b5(cid:48))\u22122 log(s/(cid:96))(cid:1) iterations.\n\nThe extra factor of n in the runtime of Corollary 1 is the price to pay to convert an (cid:96)2 bound to an (cid:96)1\nbound. Note that \u03c1 \u2265 1/n, so n\u03c1 is always larger than 1. If r = c = 1n/n are uniform distributions,\nthen n\u03c1 = 1 and no dependence on the dimension appears. However, in the extreme where r or c\ncontains an entry of constant size, we get n\u03c1 = \u2126(n).\n\n3.3 New analysis of the Sinkhorn algorithm\n\nOur new analysis allows us to obtain a dimension-independent bound on the number of iterations\nbeyond the uniform case.\nTheorem 2. Algorithm 3 with dist(A,Ur,c) = (cid:107)r(A) \u2212 r(cid:107)1 + (cid:107)c(A) \u2212 c(cid:107)1 outputs a matrix B\nij Aij\n\nsatisfying (cid:107)r(B) \u2212 r(cid:107)1 + (cid:107)c(B) \u2212 c(cid:107)1 \u2264 \u03b5(cid:48) in O(cid:0)(\u03b5(cid:48))\u22122 log(s/(cid:96))(cid:1) iterations, where s =(cid:80)\n\nand (cid:96) = minij Aij.\n\nComparing our result with Corollary 1, we see what our bound is always stronger, by up to a factor\nof n. Moreover, our analysis is extremely short. Our improved results and simpli\ufb01ed proof follow\ndirectly from the fact that we carry out the analysis entirely with respect to the Kullback\u2013Leibler\ndivergence, a common measure of statistical distance. This measure possesses a close connection\nto the total-variation distance via Pinsker\u2019s inequality (Lemma 4, below), from which we obtain the\ndesired (cid:96)1 bound. Similar ideas can be traced back at least to [GY98] where an analysis of Sinkhorn\niterations for bistochastic targets is sketched in the context of a different problem: detecting the\nexistence of a perfect matching in a bipartite graph.\nWe \ufb01rst de\ufb01ne some notation. Given a matrix A and desired row and column sums r and c, we de\ufb01ne\nthe potential (Lyapunov) function f : IRn \u00d7 IRn \u2192 IR by\n\nf (x, y) =\n\nAijexi+yj \u2212 (cid:104)r, x(cid:105) \u2212 (cid:104)c, y(cid:105) .\n\nThis auxiliary function has appeared in much of the literature on Sinkhorn projections [KLRS08,\nCMTV17, KK96, KK93]. We call the vectors x and y scaling vectors. It is easy to check that\na minimizer (x\u2217, y\u2217) of f yields the Sinkhorn projection of A: writing X = D(exp(x\u2217)) and\nY = D(exp(y\u2217)), \ufb01rst order optimality conditions imply that XAY lies in Ur,c, and therefore\nXAY = \u03a0S (A,Ur,c).\nThe following lemma exactly characterizes the improvement in the potential function f from an\niteration of Sinkhorn, in terms of our current divergence to the target marginals.\nLemma 2. If k \u2265 2, then f (xk\u22121, yk\u22121) \u2212 f (xk, yk) = K(r(cid:107)r(A(k\u22121))) + K(c(cid:107)c(A(k\u22121))) .\nProof. Assume without loss of generality that k is odd, so that c(A(k\u22121)) = c and r(A(k)) = r. (If\nk is even, interchange the roles of r and c.) By de\ufb01nition,\n\u2212 A(k)\n\n(cid:1) + (cid:104)r, xk \u2212 xk\u22121(cid:105) + (cid:104)c, yk \u2212 yk\u22121(cid:105)\n\nf (xk\u22121, yk\u22121) \u2212 f (xk, yk) =\n\n(cid:0)A(k\u22121)\n\nij\n\nij\n\n(cid:88)\n\nij\n\n(cid:88)\n(cid:88)\n\nij\n\ni\n\n=\n\nri(xk\n\ni \u2212 xk\u22121\n\ni\n\n) = K(r(cid:107)r(A(k\u22121)) + K(c(cid:107)c(A(k\u22121)) ,\n\nwhere we have used that: (cid:107)A(k\u22121)(cid:107)1 = (cid:107)A(k)(cid:107)1 = 1 and Y (k) = Y (k\u22121); for all i, ri(xk\nri log\n\nri(A(k\u22121)); and K(c(cid:107)c(A(k\u22121))) = 0 since c = c(A(k\u22121)).\n\nri\n\ni \u2212 xk\u22121\n\ni\n\n) =\n\nThe next lemma has already appeared in the literature and we defer its proof to the supplement.\nLemma 3. If A is a positive matrix with (cid:107)A(cid:107)1 \u2264 s and smallest entry (cid:96), then\nf (x, y) \u2264 log\n\nf (x, y) \u2264 f (0, 0) \u2212 min\nx,y\u2208IR\n\nf (x1, y1) \u2212 min\nx,y\u2208IR\n\n.\n\ns\n(cid:96)\n\n5\n\n\fLemma 4 (Pinsker\u2019s Inequality). For any probability measures p and q, (cid:107)p \u2212 q(cid:107)1 \u2264(cid:112)2K(p(cid:107)q).\n\nProof of Theorem 2. Let k\u2217 be the \ufb01rst iteration such that (cid:107)r(A(k\u2217)) \u2212 r(cid:107)1 + (cid:107)c(A(k\u2217)) \u2212 c(cid:107)1 \u2264 \u03b5(cid:48).\nPinsker\u2019s inequality implies that for any k < k\u2217, we have\n\n\u03b5(cid:48)2 < ((cid:107)r(A(k)) \u2212 r(cid:107)1 + (cid:107)c(A(k)) \u2212 c(cid:107)1)2 \u2264 4(K(r(cid:107)r(A(k)) + K(c(cid:107)c(A(k))) ,\n\nso Lemmas 2 and 3 imply that we terminate in k\u2217 \u2264 4\u03b5(cid:48)\u22122 log(s/(cid:96)) steps, as claimed.\n\n3.4 Greedy Sinkhorn\n\nIn addition to a new analysis of SINKHORN, we propose a new algorithm GREENKHORN which enjoys\nthe same convergence guarantee but performs better in practice. Instead of performing alternating\nupdates of all rows and columns of A, the GREENKHORN algorithm updates only a single row or\ncolumn at each step. Thus GREENKHORN updates only O(n) entries of A per iteration, rather than\nO(n2).\nIn this respect, GREENKHORN is similar to the stochastic algorithm for Sinkhorn projection proposed\nby [GCPB16]. There is a natural interpretation of both algorithms as coordinate descent algorithms\nin the dual space corresponding to row/column violations. Nevertheless, our algorithm differs from\ntheirs in several key ways. Instead of choosing a row or column to update randomly, GREENKHORN\nchooses the best row or column to update greedily. Additionally, GREENKHORN does an exact line\nsearch on the coordinate in question since there is a simple closed form for the optimum, whereas the\nalgorithm proposed by [GCPB16] updates in the direction of the average gradient. Our experiments\nestablish that GREENKHORN performs better in practice; more details appear in the Supplement.\nWe emphasize that although this algorithm is an extremely natural modi\ufb01cation of SINKHORN, previ-\nous analyses of SINKHORN cannot be modi\ufb01ed to extract any meaningful performance guarantees\non GREENKHORN. On the other hand, our new analysis of SINKHORN from Section 3.3 applies to\nGREENKHORN with only trivial modi\ufb01cations.\nPseudocode for GREENKHORN appears in Algo-\nrithm 4. We let dist(A,Ur,c) = (cid:107)r(A) \u2212 r(cid:107)1 +\n(cid:107)c(A) \u2212 c(cid:107)1 and de\ufb01ne the distance function\n\u03c1 : IR+ \u00d7 IR+ \u2192 [0, +\u221e] by\n\n\u03c1(a, b) = b \u2212 a + a log\n\na\nb\n\n.\n\nAlgorithm 4 GREENKHORN(A,Ur,c, \u03b5(cid:48))\n1: A(0) \u2190 A/(cid:107)A(cid:107)1, x \u2190 0, y \u2190 0.\n2: A \u2190 A(0)\n3: while dist(A,Ur,c) > \u03b5 do\nI \u2190 argmaxi \u03c1(ri, ri(A))\n4:\nJ \u2190 argmaxj \u03c1(cj, cj(A))\n5:\nif \u03c1(rI , rI (A)) > \u03c1(cJ , cJ (A)) then\n6:\n7:\n8:\n9:\n10:\n11: Output B \u2190 A\n\nxI \u2190 xI + log rI\nyJ \u2190 yJ + log cJ\n\nelse\n\nrI (A)\n\nThe choice of \u03c1 is justi\ufb01ed by its appearance in\nLemma 5, below. While \u03c1 is not a metric, it is\neasy to see that \u03c1 is nonnegative and satis\ufb01es\n\u03c1(a, b) = 0 iff a = b.\nWe note that after r(A) and c(A) are com-\nputed once at the beginning of the algorithm,\nGREENKHORN can easily be implemented such that each iteration runs in only O(n) time.\nTheorem 3. The algorithm GREENKHORN outputs a matrix B satisfying (cid:107)r(B) \u2212 r(cid:107)1 + (cid:107)c(B) \u2212\nij Aij and (cid:96) = minij Aij. Since each\niteration takes O(n) time, such a matrix can be found in O(n2(\u03b5(cid:48))\u22122 log(s/(cid:96))) time.\n\nc(cid:107)1 \u2264 \u03b5(cid:48) in O(n(\u03b5(cid:48))\u22122 log(s/(cid:96))) iterations, where s = (cid:80)\n\nA \u2190 D(exp(x))A(0)D(exp(y))\n\ncJ (A)\n\nThe analysis requires the following lemma, which is an easy modi\ufb01cation of Lemma 2.\nLemma 5. Let A(cid:48) and A(cid:48)(cid:48) be successive iterates of GREENKHORN, with corresponding scaling\nvectors (x(cid:48), y(cid:48)) and (x(cid:48)(cid:48), y(cid:48)(cid:48)). If A(cid:48)(cid:48) was obtained from A(cid:48) by updating row I, then\n\nf (x(cid:48), y(cid:48)) \u2212 f (x(cid:48)(cid:48), y(cid:48)(cid:48)) = \u03c1(rI , rI (A(cid:48))) ,\n\nand if it was obtained by updating column J, then\n\nf (x(cid:48), y(cid:48)) \u2212 f (x(cid:48)(cid:48), y(cid:48)(cid:48)) = \u03c1(cJ , cJ (A(cid:48))) .\n\nWe also require the following extension of Pinsker\u2019s inequality (proof in Supplement).\n\n6\n\n\fLemma 6. For any \u03b1 \u2208 \u2206n, \u03b2 \u2208 IRn\n\ni \u03c1(\u03b1i, \u03b2i). If \u03c1(\u03b1, \u03b2) \u2264 1, then\n\n+, de\ufb01ne \u03c1(\u03b1, \u03b2) =(cid:80)\n(cid:107)\u03b1 \u2212 \u03b2(cid:107)1 \u2264(cid:112)7\u03c1(\u03b1, \u03b2) .\n\nProof of Theorem 3. We follow the proof of Theorem 2. Since the row or column update is chosen\n2n (\u03c1(r, r(A)) + \u03c1(c, c(A))). If \u03c1(r, r(A)) and\ngreedily, at each step we make progress of at least 1\n\u03c1(c, c(A)) are both at most 1, then under the assumption that (cid:107)r(A) \u2212 r(cid:107)1 + (cid:107)c(A) \u2212 c(cid:107)1 > \u03b5(cid:48), our\nprogress is at least\n\n1\n2n\n\n(\u03c1(r, r(A)) + \u03c1(c, c(A))) \u2265 1\n14n\n\n((cid:107)r(A) \u2212 r(cid:107)2\n\n1 + (cid:107)c(A) \u2212 c(cid:107)2\n\n1) \u2265 1\n28n\n\n\u03b5(cid:48)2\n\nLikewise, if either \u03c1(r, r(A)) or \u03c1(c, c(A)) is larger than 1, our progress is at least 1/2n \u2265 1\nTherefore, we terminate in at most 28n\u03b5(cid:48)\u22122 log(s/(cid:96)) iterations.\n\n28n \u03b5(cid:48)2.\n\n4 Proof of Theorem 1\n\nFirst, we present a simple guarantee about the rounding Algorithm 2. The following lemma shows that\nthe (cid:96)1 distance between the input matrix F and rounded matrix G = ROUND(F,Ur,c) is controlled\nby the total-variation distance between the input matrix\u2019s marginals r(F ) and c(F ) and the desired\nmarginals r and c.\nLemma 7. If r, c \u2208 \u2206n and F \u2208 IRn\u00d7n\nG \u2208 Ur,c satisfying\n(cid:107)G \u2212 F(cid:107)1 \u2264 2\n\n(cid:104)(cid:107)r(F ) \u2212 r(cid:107)1 + (cid:107)c(F ) \u2212 c(cid:107)1\n\n+ , then Algorithm 2 takes O(n2) time to output a matrix\n\n(cid:105)\n\n.\n\nThe proof of Lemma 7 is simple and left to the Supplement. (We also describe in the Supplement a\nrandomized variant of Algorithm 2 that achieves a slightly better bound than Lemma 7). We are now\nready to prove Theorem 1.\nProof of Theorem 1. ERROR ANALYSIS. Let B be the output of PROJ(A,Ur,c, \u03b5(cid:48)), and let P \u2217 \u2208\nargminP\u2208Ur,c(cid:104)P, C(cid:105) be an optimal solution to the original OT program.\nWe \ufb01rst show that (cid:104)B, C(cid:105) is not much larger than (cid:104)P \u2217, C(cid:105). To that end, write r(cid:48) := r(B) and\nc(cid:48) := c(B). Since B = XAY for positive diagonal matrices X and Y , Lemma 1 implies B is the\noptimal solution to\n\n(3)\nBy Lemma 7, there exists a matrix P (cid:48) \u2208 Ur(cid:48),c(cid:48) such that (cid:107)P (cid:48) \u2212 P \u2217(cid:107)1 \u2264 2 ((cid:107)r(cid:48) \u2212 r(cid:107)1 + (cid:107)c(cid:48) \u2212 c(cid:107)1).\nMoreover, since B is an optimal solution of (3), we have\n\nmin\nP\u2208Ur(cid:48) ,c(cid:48)\n\n(cid:104)P, C(cid:105) \u2212 \u03b7\u22121H(P ) .\n\n(cid:104)B, C(cid:105) \u2212 \u03b7\u22121H(B) \u2264 (cid:104)P (cid:48), C(cid:105) \u2212 \u03b7\u22121H(P (cid:48)) .\n\nThus, by H\u00f6lder\u2019s inequality\n\n(cid:104)B, C(cid:105) \u2212 (cid:104)P \u2217, C(cid:105) = (cid:104)B, C(cid:105) \u2212 (cid:104)P (cid:48), C(cid:105) + (cid:104)P (cid:48), C(cid:105) \u2212 (cid:104)P \u2217, C(cid:105)\n\n\u2264 \u03b7\u22121(H(B) \u2212 H(P (cid:48))) + 2((cid:107)r(cid:48) \u2212 r(cid:107)1 + (cid:107)c(cid:48) \u2212 c(cid:107)1)(cid:107)C(cid:107)\u221e\n\u2264 2\u03b7\u22121 log n + 2((cid:107)r(cid:48) \u2212 r(cid:107)1 + (cid:107)c(cid:48) \u2212 c(cid:107)1)(cid:107)C(cid:107)\u221e ,\n\n(4)\n\nwhere we have used the fact that 0 \u2264 H(B), H(P (cid:48)) \u2264 2 log n.\nLemma 7 implies that the output \u02c6P of ROUND(B,Ur,c) satis\ufb01es the inequality (cid:107)B \u2212 \u02c6P(cid:107)1 \u2264\n2 ((cid:107)r(cid:48) \u2212 r(cid:107)1 + (cid:107)c(cid:48) \u2212 c(cid:107)1). This fact together with (4) and H\u00f6lder\u2019s inequality yields\n(cid:104)P, C(cid:105) + 2\u03b7\u22121 log n + 4((cid:107)r(cid:48) \u2212 r(cid:107)1 + (cid:107)c(cid:48) \u2212 c(cid:107)1)(cid:107)C(cid:107)\u221e .\n\n(cid:104) \u02c6P , C(cid:105) \u2264 min\nP\u2208Ur,c\n\nApplying the guarantee of PROJ(A,Ur,c, \u03b5(cid:48)), we obtain\n\n(cid:104) \u02c6P , C(cid:105) \u2264 min\nP\u2208Ur,c\n\n(cid:104)P, C(cid:105) +\n\n2 log n\n\n\u03b7\n\n+ 4\u03b5(cid:48)(cid:107)C(cid:107)\u221e .\n\n7\n\n\fPlugging in the values of \u03b7 and \u03b5(cid:48) prescribed in Algorithm 1 \ufb01nishes the error analysis.\nRUNTIME ANALYSIS. Lemma 7 shows that Step 2 of Algorithm 1 takes O(n2) time. The runtime\nof Step 1 is dominated by the PROJ(A,Ur,c, \u03b5(cid:48)) subroutine. Theorems 2 and 3 imply that both the\nSINKHORN and GREENKHORN algorithms accomplish this in S = O(n2(\u03b5(cid:48))\u22122 log s\n(cid:96) ) time, where s\nis the sum of the entries of A and (cid:96) is the smallest entry of A. Since the matrix C is nonnegative,\nthe entries of A are bounded above by 1, thus s \u2264 n2. The smallest entry of A is e\u2212\u03b7(cid:107)C(cid:107)\u221e, so\nlog 1/(cid:96) = \u03b7(cid:107)C(cid:107)\u221e. We obtain S = O(n2(\u03b5(cid:48))\u22122(log n+\u03b7(cid:107)C(cid:107)\u221e)). The proof is \ufb01nished by plugging\nin the values of \u03b7 and \u03b5(cid:48) prescribed in Algorithm 1.\n\n5 Empirical results\n\nCuturi [Cut13] already gave experimental evidence that using\nSINKHORN to solve (2) outperforms state-of-the-art techniques for\noptimal transport. In this section, we provide strong empirical ev-\nidence that our proposed GREENKHORN algorithm signi\ufb01cantly out-\nperforms SINKHORN.\nWe consider transportation between pairs of m\u00d7m greyscale images,\nnormalized to have unit total mass. The target marginals r and c\nrepresent two images in a pair, and C \u2208 IRm2\u00d7m2 is the matrix of\n(cid:96)1 distances between pixel locations. Therefore, we aim to compute\nthe earth mover\u2019s distance.\nWe run experiments on two datasets: real images, from MNIST, and synthetic images, as in Figure 1.\n\nFigure 1: Synthetic image.\n\n5.1 MNIST\n\nWe \ufb01rst compare the behavior of GREENKHORN and SINKHORN on real images. To that end, we\nchoose 10 random pairs of images from the MNIST dataset, and for each one analyze the performance\nof APPROXOT when using both GREENKHORN and SINKHORN for the approximate projection step.\nWe add negligible noise 0.01 to each background pixel with intensity 0. Figure 2 paints a clear\npicture: GREENKHORN signi\ufb01cantly outperforms SINKHORN both in the short and long term.\n\n5.2 Random images\n\nTo better understand the empirical\nbehavior of both algorithms in a\nnumber of different regimes, we de-\nvised a synthetic and tunable frame-\nwork whereby we generate images\nby choosing a randomly positioned\n\u201cforeground\u201d square in an otherwise\nblack background. The size of this\nsquare is a tunable parameter var-\nied between 20%, 50%, and 80% of\nthe total image\u2019s area.\nIntensities\nof background pixels are drawn uni-\nformly from [0, 1]; foreground pix-\nels are drawn uniformly from [0, 50].\nSuch an image is depicted in Figure 1,\nand results appear in Figure 2.\nWe perform two other experiments\nwith random images in Figure 3.\nIn the \ufb01rst, we vary the number\nof background pixels and show that\nGREENKHORN performs better when\nthe number of background pixels is\nlarger. We conjecture that this is related to the fact that GREENKHORN only updates salient rows and\n\nFigure 2: Comparison of GREENKHORN and SINKHORN\non pairs of MNIST images of dimension 28 \u00d7 28 (top) and\nrandom images of dimension 20 \u00d7 20 with 20% foreground\n(bottom). Left: distance dist(A,Ur,c) to the transport poly-\ntope (average over 10 random pairs of images). Right: maxi-\nmum, median, and minimum values of the competitive ratio\nln (dist(AS,Ur,c)/dist(AG,Ur,c)) over 10 runs.\n\n8\n\n\fcolumns at each step, whereas SINKHORN wastes time updating rows and columns corresponding to\nbackground pixels, which have negligible impact. This demonstrates that GREENKHORN is a better\nchoice especially when data is sparse, which is often the case in practice.\nIn the second, we consider the role of the regularization parameter \u03b7. Our analysis requires taking \u03b7\nof order log n/\u03b5, but Cuturi [Cut13] observed that in practice \u03b7 can be much smaller. Cuturi showed\nthat SINKHORN outperforms state-of-the art techniques for computing OT distance even when \u03b7 is a\nsmall constant, and Figure 3 shows that GREENKHORN runs faster than SINKHORN in this regime\nwith no loss in accuracy.\n\nFigure 3: Left: Comparison of median competitive ratio for random images containing 20%, 50%,\nand 80% foreground. Right: Performance of GREENKHORN and SINKHORN for small values of \u03b7.\n\n9\n\n\fAcknowledgments\n\nWe thank Michael Cohen, Adrian Vladu, John Kelner, Justin Solomon, and Marco Cuturi for helpful\ndiscussions. We are grateful to Pablo Parrilo for drawing our attention to the fact that GREENKHORN\nis a coordinate descent algorithm, and to Alexandr Andoni for references.\nJA and JW were generously supported by NSF Graduate Research Fellowship 1122374. PR is\nsupported in part by grants NSF CAREER DMS-1541099, NSF DMS-1541100, NSF DMS-1712596,\nDARPA W911NF-16-1-0551, ONR N00014-17-1-2147 and a grant from the MIT NEC Corporation.\n\nReferences\nM. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. ArXiv:1701.07875, January 2017.\n[ACB17]\n[ANOY14] A. Andoni, A. Nikolov, K. Onak, and G. Yaroslavtsev. Parallel algorithms for geometric graph\nproblems. In Proceedings of the Forty-sixth Annual ACM Symposium on Theory of Computing,\nSTOC \u201914, pages 574\u2013583, New York, NY, USA, 2014. ACM.\nP. K. Agarwal and R. Sharathkumar. Approximation algorithms for bipartite matching with metric\nand geometric costs. In Proceedings of the Forty-sixth Annual ACM Symposium on Theory of\nComputing, STOC \u201914, pages 555\u2013564, New York, NY, USA, 2014. ACM.\n\n[AS14]\n\n[AZLOW17] Z. Allen-Zhu, Y. Li, R. Oliveira, and A. Wigderson. Much faster algorithms for matrix scaling.\n\n[BCC+15]\n\n[BGKL17]\n\n[Bub15]\n\narXiv preprint arXiv:1704.02315, 2017.\nJ.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr\u00e9. Iterative Bregman projections for\nregularized transportation problems. SIAM Journal on Scienti\ufb01c Computing, 37(2):A1111\u2013A1138,\n2015.\nJ. Bigot, R. Gouet, T. Klein, and A. L\u00f3pez. Geodesic PCA in the Wasserstein space by convex\nPCA. Ann. Inst. H. Poincar\u00e9 Probab. Statist., 53(1):1\u201326, 02 2017.\nS. Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn.,\n8(3-4):231\u2013357, 2015.\n\n[BvdPPH11] N. Bonneel, M. van de Panne, S. Paris, and W. Heidrich. Displacement interpolation using\n\nLagrangian mass transport. ACM Trans. Graph., 30(6):158:1\u2013158:12, December 2011.\nN. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\nCambridge, 2006.\n\n[CBL06]\n\n[CMTV17] M. B. Cohen, A. Madry, D. Tsipras, and A. Vladu. Matrix scaling and balancing via box\n\n[Cut13]\n\n[FCCR16]\n\nconstrained Newton\u2019s method and interior point methods. arXiv:1704.02310, 2017.\nM. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In C. J. C. Burges,\nL. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 26, pages 2292\u20132300. Curran Associates, Inc., 2013.\nR. Flamary, M. Cuturi, N. Courty, and A. Rakotomamonjy. Wasserstein discriminant analysis.\narXiv:1608.08063, 2016.\n\n[GD04]\n\n[GY98]\n\n[IT03]\n\n[GPC15]\n\n[GCPB16] A. Genevay, M. Cuturi, G. Peyr\u00e9, and F. Bach. Stochastic optimization for large-scale optimal\ntransport. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances\nin Neural Information Processing Systems 29, pages 3440\u20133448. Curran Associates, Inc., 2016.\nK. Grauman and T. Darrell. Fast contour matching using approximate earth mover\u2019s distance. In\nProceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern\nRecognition, 2004. CVPR 2004., volume 1, pages I\u2013220\u2013I\u2013227 Vol.1, June 2004.\nA. Gramfort, G. Peyr\u00e9, and M. Cuturi. Fast Optimal Transport Averaging of Neuroimaging Data,\npages 261\u2013272. Springer International Publishing, 2015.\nL. Gurvits and P. Yianilos. The de\ufb02ation-in\ufb02ation method for certain semide\ufb01nite programming\nand maximum determinant completion problems. Technical report, NECI, 1998.\nP. Indyk and N. Thaper. Fast image retrieval via embeddings. In Third International Workshop on\nStatistical and Computational Theories of Vision, 2003.\nA. Juditsky, P. Rigollet, and A. Tsybakov. Learning by mirror averaging. Ann. Statist., 36(5):2183\u2013\n2206, 2008.\n\n[JRT08]\n\n[JSCG16] W. Jitkrittum, Z. Szab\u00f3, K. P. Chwialkowski, and A. Gretton. Interpretable distribution features\nwith maximum testing power. In Advances in Neural Information Processing Systems 29: Annual\nConference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona,\nSpain, pages 181\u2013189, 2016.\n\n10\n\n\f[KK93]\n\n[KK96]\n\n[KLRS08]\n\n[Leo14]\n\n[LG15]\n\n[LS14]\n\n[MJ15]\n\n[PW09]\n\n[PZ16]\n\n[Ren88]\n\n[RT11]\n\n[RT12]\n\n[RTG00]\n\n[SA12]\n\n[Sch31]\n\n[SdGP+15]\n\n[Sin67]\n\n[SJ08]\n\n[SL11]\n\n[SR04]\n\n[ST04]\n\n[Vil09]\n\n[WPR85]\n\nB. Kalantari and L. Khachiyan. On the rate of convergence of deterministic and randomized RAS\nmatrix scaling algorithms. Oper. Res. Lett., 14(5):237\u2013244, 1993.\nB. Kalantari and L. Khachiyan. On the complexity of nonnegative-matrix scaling. Linear Algebra\nAppl., 240:87\u2013103, 1996.\nB. Kalantari, I. Lari, F. Ricca, and B. Simeone. On the complexity of general matrix scaling and\nentropy minimization via the RAS algorithm. Math. Program., 112(2, Ser. A):371\u2013401, 2008.\nC. Leonard. A survey of the Schr\u00f6dinger problem and some of its connections with optimal\ntransport. Discrete and Continuous Dynamical Systems, 34(4):1533\u20131574, 2014.\nJ. R. Lloyd and Z. Ghahramani. Statistical model criticism using kernel two sample tests. In\nProceedings of the 28th International Conference on Neural Information Processing Systems,\nNIPS\u201915, pages 829\u2013837, Cambridge, MA, USA, 2015. MIT Press.\nY. T. Lee and A. Sidford. Path \ufb01nding methods for linear programming: Solving linear programs\nin \u00d5(\u221arank) iterations and faster algorithms for maximum \ufb02ow. In Proceedings of the 2014\nIEEE 55th Annual Symposium on Foundations of Computer Science, FOCS \u201914, pages 424\u2013433,\nWashington, DC, USA, 2014. IEEE Computer Society.\nJ. Mueller and T. Jaakkola. Principal differences analysis: Interpretable characterization of\ndifferences between distributions. In Proceedings of the 28th International Conference on Neural\nInformation Processing Systems, NIPS\u201915, pages 1702\u20131710, Cambridge, MA, USA, 2015. MIT\nPress.\nO. Pele and M. Werman. Fast and robust earth mover\u2019s distances. In 2009 IEEE 12th International\nConference on Computer Vision, pages 460\u2013467, Sept 2009.\nV. M. Panaretos and Y. Zemel. Amplitude and phase variation of point processes. Ann. Statist.,\n44(2):771\u2013812, 04 2016.\nJ. Renegar. A polynomial-time algorithm, based on Newton\u2019s method, for linear programming.\nMathematical Programming, 40(1):59\u201393, 1988.\nP. Rigollet and A. Tsybakov. Exponential screening and optimal rates of sparse estimation. Ann.\nStatist., 39(2):731\u2013771, 2011.\nP. Rigollet and A. Tsybakov. Sparse estimation by exponential weighting. Statistical Science,\n27(4):558\u2013575, 2012.\nY. Rubner, C. Tomasi, and L. J. Guibas. The earth mover\u2019s distance as a metric for image retrieval.\nInt. J. Comput. Vision, 40(2):99\u2013121, November 2000.\nR. Sharathkumar and P. K. Agarwal. A near-linear time \u0001-approximation algorithm for geometric\nbipartite matching. In H. J. Karloff and T. Pitassi, editors, Proceedings of the 44th Symposium on\nTheory of Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22, 2012, pages\n385\u2013394. ACM, 2012.\nE. Schr\u00f6dinger. \u00dcber die Umkehrung der Naturgesetze. Angewandte Chemie, 44(30):636\u2013636,\n1931.\nJ. Solomon, F. de Goes, G. Peyr\u00e9, M. Cuturi, A. Butscher, A. Nguyen, T. Du, and L. Guibas.\nConvolutional wasserstein distances: Ef\ufb01cient optimal transportation on geometric domains. ACM\nTrans. Graph., 34(4):66:1\u201366:11, July 2015.\nR. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The\nAmerican Mathematical Monthly, 74(4):402\u2013405, 1967.\nS. Shirdhonkar and D. W. Jacobs. Approximate earth mover\u2019s distance in linear time. In 2008\nIEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20138, June 2008.\nR. Sandler and M. Lindenbaum. Nonnegative matrix factorization with earth mover\u2019s distance\nmetric for image analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n33(8):1590\u20131602, Aug 2011.\nG. J. Sz\u00e9kely and M. L. Rizzo. Testing for equal distributions in high dimension. Inter-Stat\n(London), 11(5):1\u201316, 2004.\nD. A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph\nIn Proceedings of the Thirty-sixth Annual ACM\nsparsi\ufb01cation, and solving linear systems.\nSymposium on Theory of Computing, STOC \u201904, pages 81\u201390, New York, NY, USA, 2004. ACM.\nC. Villani. Optimal transport, volume 338 of Grundlehren der Mathematischen Wissenschaften\n[Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 2009. Old and new.\nM. Werman, S. Peleg, and A. Rosenfeld. A distance metric for multidimensional histograms.\nComputer Vision, Graphics, and Image Processing, 32(3):328 \u2013 336, 1985.\n\n11\n\n\f", "award": [], "sourceid": 1203, "authors": [{"given_name": "Jason", "family_name": "Altschuler", "institution": "MIT"}, {"given_name": "Jonathan", "family_name": "Niles-Weed", "institution": "MIT"}, {"given_name": "Philippe", "family_name": "Rigollet", "institution": "MIT"}]}