{"title": "A Direct tilde{O}(1/epsilon) Iteration Parallel Algorithm for Optimal Transport", "book": "Advances in Neural Information Processing Systems", "page_first": 11359, "page_last": 11370, "abstract": "Optimal transportation, or computing the Wasserstein or ``earth mover's'' distance between two $n$-dimensional distributions, is a fundamental primitive which arises in many learning and statistical settings. We give an algorithm which solves the problem to additive $\\epsilon$ accuracy with $\\tilde{O}(1/\\epsilon)$ parallel depth and $\\tilde{O}\\left(n^2/\\epsilon\\right)$ work. \n\t\t\\cite{BlanchetJKS18, Quanrud19} obtained this runtime through reductions to positive linear programming and matrix scaling. However, these reduction-based algorithms use subroutines which may be impractical due to requiring solvers for second-order iterations (matrix scaling) or non-parallelizability (positive LP). Our methods match the previous-best work bounds by \\cite{BlanchetJKS18, Quanrud19} while either improving parallelization or removing the need for linear system solves, and improve upon the previous best first-order methods running in time $\\tilde{O}(\\min(n^2 / \\epsilon^2, n^{2.5}  / \\epsilon))$ \\cite{DvurechenskyGK18, LinHJ19}. We obtain our results by a primal-dual extragradient method, motivated by recent theoretical improvements to maximum flow \\cite{Sherman17}.", "full_text": "A Direct \u02dcO(1/\u0001) Iteration Parallel Algorithm for\n\nOptimal Transport\n\nArun Jambulapati, Aaron Sidford, and Kevin Tian\n{jmblpati,sidford,kjtian}@stanford.edu\n\nStanford University\n\nAbstract\n\nOptimal transportation, or computing the Wasserstein or \u201cearth mover\u2019s\u201d distance\nbetween two n-dimensional distributions, is a fundamental primitive which arises\nin many learning and statistical settings. We give an algorithm which solves the\n\nproblem to additive \u0001 accuracy with \u02dcO(1/\u0001) parallel depth and \u02dcO(cid:0)n2/\u0001(cid:1) work.\n\n[BJKS18, Qua19] obtained this runtime through reductions to positive linear pro-\ngramming and matrix scaling. However, these reduction-based algorithms use\nsubroutines which may be impractical due to requiring solvers for second-order\niterations (matrix scaling) or non-parallelizability (positive LP). Our methods\nmatch the previous-best work bounds by [BJKS18, Qua19] while either improv-\ning parallelization or removing the need for linear system solves, and improve\nupon the previous best \ufb01rst-order methods running in time \u02dcO(min(n2/\u00012, n2.5/\u0001))\n[DGK18, LHJ19]. We obtain our results by a primal-dual extragradient method,\nmotivated by recent theoretical improvements to maximum \ufb02ow [She17].\n\n1\n\nIntroduction\n\nOptimal transport is playing an increasingly important role as a subroutine in tasks arising\nin machine learning [ACB17], computer vision [BvdPPH11, SdGP+15], robust optimization\n[EK18, BK17], and statistics [PZ16]. Given these applications for large scale learning, design-\ning algorithms for ef\ufb01ciently approximately solving the problem has been the subject of extensive\nrecent research [Cut13, AWR17, GCPB16, CK18, DGK18, LHJ19, BJKS18, Qua19].\nGiven two vectors r and c in the n-dimensional probability simplex \u2206n and a cost matrix C \u2208\nRn\u00d7n\u22650\n\n(cid:110)\n(cid:111)\nX \u2208 Rn\u00d7n\u22650 , X1 = r, X(cid:62)1 = c\n\n.\n\n(1)\n\n1, the optimal transportation problem is\n(cid:104)C, X(cid:105), where Ur,c\n\ndef=\n\nmin\nX\u2208Ur,c\n\nThis problem arises from de\ufb01ning the Wasserstein or Earth mover\u2019s distance between discrete prob-\nability measures r and c, as the cheapest coupling between the distributions, where the cost of the\ncoupling X \u2208 Ur,c is (cid:104)C, X(cid:105). If r and c are viewed as distributions of masses placed on n points in\nsome space (typically metric), the Wasserstein distance is the cheapest way to move mass to trans-\nform r into c. In (1), X represents the transport plan (Xij is the amount moved from ri to cj) and\nC represents the cost of movement (Cij is the cost of moving mass from ri to cj).\nThroughout, the value of (1) is denoted OPT. We call \u02c6X \u2208 Ur,c an \u0001-approximate transportation\nplan if (cid:104)C, \u02c6X(cid:105) \u2264 OPT + \u0001. Our goal is to design an ef\ufb01cient algorithm to produce such a \u02c6X.\n\n1Similarly to earlier works, we focus on square matrices; generalizations to rectangular matrices are straight-\n\nforward.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 Our Contributions\nOur main contribution is an algorithm running in \u02dcO((cid:107)C(cid:107)max/\u0001) parallelelizable iterations2 and\n\u02dcO(n2(cid:107)C(cid:107)max/\u0001) total work producing an \u0001-approximate transport plan.\nMatching runtimes were given in the recent work of [BJKS18, Qua19]. Their runtimes were ob-\ntained via reductions to matrix scaling and positive linear programming, each well-studied problems\nin theoretical computer science. However, the matrix scaling algorithm is a second-order Newton-\ntype method which makes calls to structured linear system solvers, and the positive LP algorithm\nis not parallelizable (i.e. has depth polynomial in dimension). These features potentially limit the\npracticality of these algorithms. The key remaining open question this paper addresses is, is there an\nef\ufb01cient \ufb01rst-order, parallelizable algorithm for approximating optimal transport? We answer this\naf\ufb01rmatively and give an ef\ufb01cient, parallelizable primal-dual \ufb01rst-order method; the only additional\noverhead is a scheme for implementing steps, incurring roughly an additional log \u0001\u22121 factor.\nOur approach heavily leverages the recent improvement to the maximum \ufb02ow problem, and more\nbroadly two-player games on a simplex ((cid:96)1 ball) and a box ((cid:96)\u221e ball), due to the breakthrough\nwork of [She17]. First, we recast (1) as a minimax game between a box and a simplex, proving\ncorrectness via a rounding procedure known in the optimal transport literature. Second, we show\nhow to adapt the dual extrapolation scheme under the weaker convergence requirements of area-\nconvexity, following [She17], to obtain an approximate minimizer to our primal-dual objective in\nthe stated runtime. En route, we slightly simplify analysis in [She17] and relate it more closely to\nthe existing extragradient literature.\nFinally, we give preliminary experimental evidence showing our algorithm can be practical, and\nhighlight some open directions in bridging the gap between theory and practice of our method, as\nwell as accelerated gradient schemes [DGK18, LHJ19] and Sinkhorn iteration.\n\n1.2 Previous Work\n\nOptimal Transport. The problem of giving ef\ufb01cient algorithms to \ufb01nd \u0001-approximate transport\nplans \u02c6X which run in nearly linear time3 has been addressed by a line of recent work, starting with\n[Cut13] and improved upon in [GCPB16, AWR17, DGK18, LHJ19, BJKS18, Qua19]. We brie\ufb02y\ndiscuss their approaches here.\nWorks by [Cut13, AWR17] studied the Sinkhorn algorithm, an alternating minimization scheme.\nRegularizing (1) with an \u03b7\u22121 multiple of entropy and computing the dual, we arrive at the problem\n\n1(cid:62)B\u03b7C(x, y)1 \u2212 r(cid:62)x \u2212 c(cid:62)y where B\u03b7C(x, y)ij = exi+yj\u2212\u03b7Cij .\n\nmin\nx,y\u2208Rn\n\nThis problem is equivalent to computing diagonal scalings X and Y for M = exp(\u2212\u03b7C) such that\nXM Y has row sums r and column sums c. The Sinkhorn iteration alternates \ufb01xing the row sums\nand the column sums by left and right scaling by diagonal matrices until an approximation of such\nscalings is found, or equivalently until XM Y is close to being in Ur,c.\nAs shown in [AWR17], we can round the resulting almost-transportation plan to a transportation\nthe objective. Further, [AWR17] showed that \u02dcO((cid:107)C(cid:107)3\nmax/\u00013) iterations of this scheme suf\ufb01ced\nto obtain a matrix which \u0001/(cid:107)C(cid:107)max-approximately meets the demands in (cid:96)1 with good objective\nvalue, by analyzing it as an instance of mirror descent with an entropic regularizer. The same\nwork proposed an alternative algorithm, Greenkhorn, based on greedy coordinate descent. [DGK18,\n\nplan which lies in Ur,c in linear time, losing at most 2(cid:107)C(cid:107)max((cid:107)X1 \u2212 r(cid:107)1 +(cid:13)(cid:13)X(cid:62)1 \u2212 c(cid:13)(cid:13)1) in\nmax/\u00012(cid:1) work, suf\ufb01ce\nLHJ19] showed that \u02dcO(cid:0)(cid:107)C(cid:107)2\n\nmax/\u00012(cid:1) iterations, corresponding to \u02dcO(cid:0)n2(cid:107)C(cid:107)2\n\nfor both Sinkhorn and Greenkhorn, the current state-of-the-art for this line of analysis.\nAn alternative approach based on \ufb01rst-order methods was studied by [DGK18, LHJ19]. These works\nconsidered minimizing an entropy-regularized Equation 1; the resulting weighted softmax function\nis prevalent in the literature on approximate linear programming [Nes05], and has found similar\n\n2Our iterations consist of vector operations and matrix-vector products, which are easily parallelizable.\n\nThroughout (cid:107)C(cid:107)max is the largest entry of C.\n(where the size of input C is n2), and polynomial dependence on (cid:107)C(cid:107)max , \u0001\u22121.\n\n3We use \u201cnearly linear\u201d to describe complexities which have an n2polylog(n) dependence on the dimension\n\n2\n\n\fapplications in near-linear algorithms for maximum \ufb02ow [She13, KLOS14, ST18] and positive linear\nprogramming [You01, AO15]. An unaccelerated algorithm, viewable as (cid:96)\u221e gradient descent, was\nanalyzed in [DGK18] and ran in \u02dcO((cid:107)C(cid:107)max/\u00012) iterations. Further, an accelerated algorithm was\ndiscussed, for which the authors claimed an \u02dcO(n1/4(cid:107)C(cid:107)0.5\nmax/\u0001) iteration count. [LHJ19] showed\nthat the algorithm had an additional dependence on a parameter as bad as n1/4, roughly due to a\ngap between the (cid:96)2 and (cid:96)\u221e norms. Thus, the state of the art runtime in this line is the better of\n\nmax/\u0001(cid:1), \u02dcO(cid:0)n2(cid:107)C(cid:107)max/\u00012(cid:1) operations. The dependence on dimension of the former\n\n\u02dcO(cid:0)n2.5(cid:107)C(cid:107)0.5\n\nof these runtimes matches that of the linear programming solver of [LS14, LS15], which obtain\na polylogarithmic dependence on \u0001\u22121, rather than a polynomial dependence; thus, the question of\nobtaining an accelerated \u0001\u22121 dependence without worse dimension dependence remained open.\nThis was partially settled in [BJKS18, Qua19], which studied the relationship of optimal trans-\nport to fundamental algorithmic problems in theoretical computer science, namely positive linear\nprogramming and matrix scaling, for which signi\ufb01cantly-improved runtimes have been recently ob-\ntained [AO15, ZLdOW17, CMTV17]. In particular, they showed that optimal transport could be\nreduced to instances of either of these objectives, for which \u02dcO ((cid:107)C(cid:107)max/\u0001) iterations, each of which\nrequired linear O(n2) work, suf\ufb01ced. However, both of these reductions are based on black-box\nmethods for which practical implementations are not known; furthermore, in the case of positive\nlinear programming a parallel \u02dcO(1/\u0001)-iteration algorithm is not known. [BJKS18] also showed any\npolynomial improvement to the runtime of our paper in the dependence on either \u0001 or n would result\nin maximum-cardinality bipartite matching in dense graphs faster than \u02dcO(n2.5) without fast matrix\nmultiplication [San09], a fundamental open problem unresolved for almost 50 years [HK73].\n\n1st-order Parallel\n\nApproach\nInterior point\n\nSink/Greenkhorn\nGradient descent\n\nAcceleration\nMatrix scaling\nPositive LP\n\nDual extrapolation\n\nNo\nYes\nYes\nYes\nNo\nYes\nYes\n\nNo\nYes\nYes\nYes\nYes\nNo\nYes\n\nYear\n2015\n\n2017-19\n\n2018\n\n2018-19\n\n2018\n\nAuthor\n[LS15]\n[AWR17]\n[DGK18]\n[LHJ19]\n[BJKS18]\n\n2018-19\n\n[BJKS18, Qua19]\n\n2019\n\nThis work\n\nComplexity\n\n\u02dcO(n2.5)\n\u02dcO(n2(cid:107)C(cid:107)2\nmax/\u00012)\n\u02dcO(n2(cid:107)C(cid:107)2\nmax/\u00012)\n\u02dcO(n2.5(cid:107)C(cid:107)max/\u0001)\n\u02dcO(n2(cid:107)C(cid:107)max/\u0001)\n\u02dcO(n2(cid:107)C(cid:107)max/\u0001)\n\u02dcO(n2(cid:107)C(cid:107)max/\u0001)\n\nTable 1: Optimal transport algorithms. Algorithms using second-order information use potentially-\nexpensive SDD system solvers; the runtime analysis of Sink/Greenkhorn is due to [DGK18, LHJ19].\n\nSpecializations of the transportation problem to (cid:96)p metric spaces or arising from geometric settings\nhave been studied [SA12, AS14, ANOY14]. These specialized approaches seem fundamentally\ndifferent than those concerning the more general transportation problem.\nFinally, we note recent work [ABRW18] showed the promise of using the Nystr\u00a8om method for low-\nrank approximations to achieve speedup in theory and practice for transport problems arising from\nspeci\ufb01c metrics. As our method is based on matrix-vector operations, where low-rank approxima-\ntions may be applicable, we \ufb01nd it interesting to see if our method can be combined with these\nimprovements.\nRemark. During the revision process for this work, an independent result [LMR19] was published\nto arXiv, obtaining improved runtimes for optimal transport via a combinatorial algorithm. The\nwork obtains a runtime of \u02dcO(n2(cid:107)C(cid:107)max/\u0001 + n(cid:107)C(cid:107)2\nmax/\u00012), which is worse than our runtime by a\nlow-order term. Furthermore, it does not appear to be parallelizable.\nBox-simplex objectives. Our main result follows from improved algorithms for bilinear minimax\nproblems over one simplex domain and one box domain developed in [She17]. This fundamental\nminimax problem captures (cid:96)1 and (cid:96)\u221e regression over a simplex and box respectively, and inspired\nthe development of conjugate smoothing [Nes05] as well as mirror prox / dual extrapolation [Nem04,\nNes07]. These latter two approaches are extragradient methods (using two gradient operations per\niteration rather than one) for approximately solving a family of problems, which includes convex\nminimization and \ufb01nding a saddle point to a convex-concave function. These methods simulate\nbackwards Euler discretization of the gradient \ufb02ow, similar to how mirror descent simulates forwards\n\n3\n\n\fEuler discretization [DO19]. The role of the extragradient step is a \ufb01xed point iteration (of two steps)\nwhich is a good approximation of the backwards Euler step when the operator is Lipschitz.\nNonetheless, the analysis of [Nem04, Nes07] fell short in obtaining a 1/T rate of convergence\nwithout worse dependence on dimension for these domains, where T is the iteration count (which\nwould correspond to a \u02dcO (1/\u0001) runtime for approximate minimization). The fundamental barrier was\nthat over a box, any strongly-convex regularizer in the (cid:96)\u221e norm has a dimension-dependent domain\nsize (shown in [ST18]). This barrier can also be viewed as the reason for the worse dimension\ndependence in the accelerated scheme of [DGK18, LHJ19].\nThe primary insight of [She17] was that previous approaches attempted to regularize the schemes of\n[Nem04, Nes07] with separable regularizers, i.e. the sum of a regularizer which depends only on the\nprimal block and one which depends only on the dual. If, say, the domain of the primal block was a\nbox, then such a regularization scheme would run into the (cid:96)\u221e barrier and incur a worse dependence\non dimension. However, by more carefully analyzing the requirements of these algorithms, [She17]\nconstructed a non-separable regularizer with small domain size, satisfying a property termed area-\nconvexity which suf\ufb01ced for provable convergence of dual extrapolation [Nes07]. Interestingly, the\nproperty seems specialized to dual extrapolation and not mirror prox [Nem04].\n\n2 Overview\n\nFirst, in Section 2.1 we \ufb01rst describe a reformulation of (1) as a primal-dual objective, which we\nsolve approximately in Section 3. Then in Section 2.2 we give additional notation critical for our\nanalysis4. In Section 3 we leverage this to give an overview of our main algorithm.\n\n2.1\n\n(cid:96)1-regression formulation\n\nWe adapt the view of [BJKS18, Qua19] of the objective (1) as a positive linear program. Let d be the\n(vectorized) cost matrix C associated with the instance and let \u2206n2 be the n2 dimensional simplex5.\nWe recall r, c are speci\ufb01ed row and column sums with 1(cid:62)r = 1(cid:62)c = 1. The optimal transport\nproblem can be written as, for m = n2, and A \u2208 {0, 1}2n\u00d7m, b \u2208 R2n\u22650, for A the (unsigned)\nedge-incidence matrix of the underlying bipartite graph and b the concatenation of r and c.\n\nmin\n\nx\u2208\u2206m,Ax=b\n\nd(cid:62)x.\n\n1 1\n0 0\n0 0\n0 0\n1 0\n0 1\n\n0\n1\n0\n1\n0\n0\n\n0\n1\n0\n0\n1\n0\n\n0\n1\n0\n0\n0\n1\n\n0\n0\n1\n1\n0\n0\n\n0\n0\n1\n0\n1\n0\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 , b =\n\n0\n0\n1\n0\n0\n1\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n1\n0\n0\n1\n0\n0\n\nA =\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 .\n\n1/3\n1/3\n1/3\n1/3\n1/3\n1/3\n\n(2)\n\nFigure 1: Edge-incidence matrix A of a 3 \u00d7 3 bipartite graph and uniform demands.\n\nIn particular, A is the 0-1 matrix on V \u00d7 E such that Ave = 1 iff v is an endpoint of edge e. We\nsummarize some additional properties of the constraint matrix A and vector b.\nFact 2.1. A, b have the following properties.\n\n1. A \u2208 {0, 1}2n\u00d7m has 2-sparse columns and n-sparse rows. Thus (cid:107)A(cid:107)1\u21921 = 2.\n\n2. b(cid:62) =(cid:0)r(cid:62) c(cid:62)(cid:1), so that (cid:107)b(cid:107)1 = 2.\n\n3. A has 2n2 nonzero entries.\n\n4Because many of the objects de\ufb01ned in Section 2.2 are developed in Section 2.1, we postpone their state-\n\nment, but refer the reader to Section 2.2 for any ambiguous de\ufb01nitions.\n\n5We use d because C often arises from distances in a metric space, and to avoid overloading c.\n\n4\n\n\fSection 4 recalls the proof of the following theorem, which \ufb01rst appeared in [AWR17].\nTheorem 2.2 (Rounding guarantee, Lemma 7 in [AWR17]). There is an algorithm which takes \u02dcx\nwith (cid:107)A\u02dcx \u2212 b(cid:107)1 \u2264 \u03b4 and produces \u02c6x in O(n2) time, with\n\nA\u02c6x = b,(cid:107)\u02dcx \u2212 \u02c6x(cid:107)1 \u2264 2\u03b4.\n\nWe now show how the rounding procedure gives a roadmap for our approach. Consider the following\n(cid:96)1 regression objective over the simplex (a similar penalized objective appeared in [She13]):\n\nd(cid:62)x + 2(cid:107)d(cid:107)\u221e (cid:107)Ax \u2212 b(cid:107)1 .\n\nmin\nx\u2208\u2206m\n\n(3)\n\nWe show that the penalized objective value is still OPT, and furthermore any approximate minimizer\nyields an approximate transport plan.\nLemma 2.3 (Penalized (cid:96)1 regression). The value of (3) is OPT. Also, given \u02dcx, an \u0001-approximate\nminimizer to (3), we can \ufb01nd \u0001-approximate transportation plan \u02c6x in O(n2) time.\n\nProof. Recall OPT = minx\u2208\u2206m,Ax=b d(cid:62)x. Let \u02dcx be the minimizing argument in (3). We claim\nthere is some optimal \u02dcx with A\u02dcx = b; clearly, the \ufb01rst claim is then true. Suppose otherwise, and let\n(cid:107)A\u02dcx \u2212 b(cid:107)1 = \u03b4 > 0. Then, let \u02c6x be the result of the algorithm in Theorem 2.2, applied to \u02dcx, so that\nA\u02c6x = b,(cid:107)\u02dcx \u2212 \u02c6x(cid:107)1 \u2264 2\u03b4. We then have\nd(cid:62) \u02c6x + 2(cid:107)d(cid:107)\u221e (cid:107)A\u02c6x \u2212 b(cid:107)1 = d(cid:62)(\u02c6x \u2212 \u02dcx) + d(cid:62) \u02dcx \u2264 d(cid:62) \u02dcx + (cid:107)d(cid:107)\u221e (cid:107)\u02c6x \u2212 \u02dcx(cid:107)1 \u2264 d(cid:62) \u02dcx + 2(cid:107)d(cid:107)\u221e \u03b4.\nThe objective value of \u02c6x is no more than of \u02dcx, a contradiction. By this discussion, we can take any\napproximate minimizer to (3) and round it to a transport plan without increasing the objective.\n\nSection 3 proves Theorem 2.4, which says we can ef\ufb01ciently \ufb01nd an approximate minimizer to (3).\nTheorem 2.4 (Approximate (cid:96)1 regression over the simplex). There is an algorithm (Algorithm 1)\ntaking input \u0001, which has O(((cid:107)d(cid:107)\u221e log n log \u03b3)/\u0001) parallel depth for \u03b3 = log n\u00b7(cid:107)d(cid:107)\u221e /\u0001, and total\nwork O(n2((cid:107)d(cid:107)\u221e log n log \u03b3)/\u0001), and obtains \u02dcx an \u0001-additive approximation to the objective in (3).\nWe will approach proving Theorem 2.4 through a primal-dual viewpoint, in light of the following\n(based on the de\ufb01nition of the (cid:96)1 norm):\n\n(cid:0)y(cid:62)Ax \u2212 b(cid:62)y(cid:1) .\n\n(4)\n\nmin\nx\u2208\u2206m\n\nd(cid:62)x + 2(cid:107)d(cid:107)\u221e (cid:107)Ax \u2212 b(cid:107)1 = min\nx\u2208\u2206m\n\nmax\n\ny\u2208[\u22121,1]2n\n\nd(cid:62)x + 2(cid:107)d(cid:107)\u221e\n\nFurther, a low-duality gap pair to (4) yields an approximate minimizer to (3).\nLemma 2.5 (Duality gap to error). Suppose x, y is feasible (x \u2208 \u2206m, y \u2208 [\u22121, 1]2n), and for any\n\nfeasible u, v,(cid:0)d(cid:62)x + 2(cid:107)d(cid:107)\u221e\n\n(cid:0)v(cid:62)Ax \u2212 b(cid:62)v(cid:1)(cid:1) \u2212(cid:0)d(cid:62)u + 2(cid:107)d(cid:107)\u221e\n\n(cid:0)y(cid:62)Au \u2212 b(cid:62)y(cid:1)(cid:1) \u2264 \u03b4.\n\nThen, we have d(cid:62)x + 2(cid:107)d(cid:107)\u221e (cid:107)Ax \u2212 b(cid:107)1 \u2264 \u03b4 + OPT.\n\nProof. The result follows from maximizing over v, and noting that for the minimizing u,\n\n(cid:0)y(cid:62)Au \u2212 b(cid:62)y(cid:1) \u2264 d(cid:62)u + 2(cid:107)d(cid:107)\u221e (cid:107)Au \u2212 b(cid:107)1 = OPT.\n\nd(cid:62)u + 2(cid:107)d(cid:107)\u221e\n\nCorrespondingly, Section 3 gives an algorithm which obtains (x, y) with bounded duality gap within\nthe runtime of Theorem 2.4.\n\n5\n\n\f2.2 Notation\nR\u22650 is the nonnegative reals. 1 is the all-ones vector of appropriate dimension when clear. The\nprobability simplex is \u2206d def= {v | v \u2208 Rd\u22650, 1(cid:62)v = 1}. We say matrix X is in the simplex of\n(cid:107)\u00b7(cid:107)1 and (cid:107)\u00b7(cid:107)\u221e are the (cid:96)1 and (cid:96)\u221e norms, i.e. (cid:107)v(cid:107)1 = (cid:80)\nappropriate dimensions when its (nonnegative) entries sum to one.\ni |vi| and (cid:107)v(cid:107)\u221e = maxi |vi|. When A is\na matrix, we let (cid:107)A(cid:107)p\u2192q be the matrix operator norm, i.e. sup(cid:107)v(cid:107)p=1 (cid:107)Av(cid:107)q, where (cid:107)\u00b7(cid:107)p is the (cid:96)p\nThroughout log is the natural logarithm. For x \u2208 \u2206d, h(x) =(cid:80)\nnorm. In particular, (cid:107)A(cid:107)1\u21921 is the largest (cid:96)1 norm of a column of A.\n\ni\u2208[d] xi log xi is (negative) entropy\nwhere 0 log 0 = 0 by convention. It is well-known that maxx\u2208\u2206d h(x) \u2212 minx\u2208\u2206d h(x) = log d.\nWe also use the Bregman divergence of a regularizer and the proximal operator of a divergence.\nDe\ufb01nition 2.6 (Bregman divergence). For (differentiable) regularizer r and z, w in its domain, the\nBregman divergence from z to w is\n\nz (w) def= r(w) \u2212 r(z) \u2212 (cid:104)\u2207r(z), w \u2212 z(cid:105).\nV r\n\nWhen r is convex, the divergence is nonnegative and convex in the argument (w in the de\ufb01nition).\nDe\ufb01nition 2.7 (Proximal operator). For (differentiable) regularizer r, z in its domain, and g in the\ndual space (when the domain is in Rd, so is the dual space), we de\ufb01ne the proximal operator as\n\nProxr\n\nz(g) def= argminw {(cid:104)g, w(cid:105) + V r\n\nz (w)} .\n\nSeveral variables have specialized meaning throughout. All graphs considered will be on 2n vertices\nwith m edges, i.e. m = n2. A \u2208 R2n\u00d7m is the edge-incidence matrix. d is the vectorized cost\nmatrix C. b is the constraint vector, concatenating row and column constraints r, c. In algorithms\nfor solving (4), x and y are primal (in a simplex) and dual (in a box) variables respectively. In\nSection 3, we adopt the linear programming perspective where the decision variable x \u2208 \u2206m is\na vector. In Section 4, for convenience we take the perspective where X is an un\ufb02attened n \u00d7 n\nmatrix. Ur,c is the feasible polytope: when the domain is vectors, Ur,c is x | Ax = b, and when it is\nmatrices, Ur,c is X | X1 = r, X(cid:62)1 = c (by \ufb02attening X this is consistent).\n\n3 Main Algorithm\n\nThis section describes our algorithm for \ufb01nding a primal-dual pair (x, y) with a small duality gap,\nwith respect to the objective in (4), which we restate here for convenience:\n\n(cid:0)y(cid:62)Ax \u2212 b(cid:62)y(cid:1) , X def= \u2206m, Y def= [\u22121, 1]2n.\n\ny\u2208Y d(cid:62)x + 2(cid:107)d(cid:107)\u221e\n\nmin\nx\u2208X max\n\n(Restatement of (4))\n\nOur algorithm is a specialization of the algorithm in [She17]. One of our technical contributions in\nthis regard is an analysis of the algorithm which more closely relates it to the analysis of dual extrap-\nolation [Nes07], an algorithm for \ufb01nding approximate saddle points with a more standard analysis.\nIn Section 3.1, we give the algorithmic framework and convergence analysis. In Section B.1, we\nprovide analysis of an alternating minimization scheme for implementing steps of the procedure.\nThe same procedure was used in [She17] which claimed without proof the linear convergence rate\nof the alternating minimization; we hope the analysis will make the method more broadly accessible\nto the optimization community. We defer many proofs to Appendix B.\n\n3.1 Dual Extrapolation Framework\n\nFor an objective F (x, y) convex in x and concave in y, the standard way to measure the duality gap is\nto de\ufb01ne the gradient operator g(x, y) = (\u2207xF (x, y),\u2212\u2207yF (x, y)), and show that for z = (x, y)\nand any u on the product space, the regret, (cid:104)g(z), z \u2212 u(cid:105), is small. Correspondingly, we de\ufb01ne\n\ng(x, y) def=(cid:0)d + 2(cid:107)d(cid:107)\u221e A(cid:62)y, 2(cid:107)d(cid:107)\u221e (b \u2212 Ax)(cid:1) .\n\nThe dual extrapolation framework [Nes07] requires a regularizer on the product space. The algo-\nrithm is simple to state; it takes two \u201cmirror descent-like\u201d steps each iteration, maintaining a state\n\n6\n\n\fst in the dual space6. A typical setup is a Lipschitz gradient operator and a regularizer which is\nthe sum of canonical strongly-convex regularizers in the norms corresponding to the product space\nX ,Y. However, recent works have shown that this setup can be greatly relaxed and still obtain\nsimilar rates of convergence. In particular, [She17] introduced the following de\ufb01nition.\nDe\ufb01nition 3.1 (Area-convexity). Regularizer r is \u03ba-area-convex with respect to operator g if for\nany points a, b, c in its domain,\n\nr(a) + r(b) + r(c) \u2212 3r\n\n\u03ba\n\n\u2265 (cid:104)g(b) \u2212 g(a), b \u2212 c(cid:105).\n\n(5)\n\n(cid:18) a + b + c\n\n(cid:19)(cid:19)\n\n3\n\n(cid:18)\n\nArea-convexity is so named because (cid:104)g(b)\u2212g(a), b\u2212c(cid:105) can be viewed as measuring the \u201carea\u201d of the\ntriangle with vertices a, b, c with respect to some Jacobian matrix. In the case of bilinear objectives,\nthe left hand side in the de\ufb01nition of area-convexity is invariant to permuting a, b, c, whereas the\nsign of the right hand side can be \ufb02ipped by interchanging a, c, so area-convexity implies convexity.\nHowever, it does not even imply the regularizer r is strongly-convex, a typical assumption for the\nconvergence of mirror descent methods.\nWe state the algorithm for time horizon T ; the only difference from [Nes07] is a factor of 2 in\nde\ufb01ning st+1, i.e. adding a 1/2\u03ba multiple rather than 1/\u03ba. We \ufb01nd it of interest to explore whether\nthis change is necessary or speci\ufb01c to the analysis of [She17].\n\nAlgorithm 1 \u00afw = Dual-Extrapolation(\u03ba, r, g, T ): Dual extrapolation with area-convex r.\n\nInitialize s0 = 0, let \u00afz be the minimizer of r.\nfor t < T do\nzt \u2190 Proxr\n\u00afz(st).\nwt \u2190 Proxr\n\u00afz\nst+1 \u2190 st + 1\nt \u2190 t + 1.\n\n\u03ba g(zt)(cid:1).\n\n2\u03ba g(wt).\n\n(cid:0)st + 1\n(cid:80)\n\nend for\nreturn \u00afw def= 1\nT\n\nt\u2208[T ] wt.\n\nLemma 3.2 (Dual extrapolation convergence). Suppose r is \u03ba-area-convex with respect to g. Fur-\nther, suppose for some u, \u0398 \u2265 r(u) \u2212 r(\u00afz). Then, the output \u00afw to Algorithm 1 satis\ufb01es\n\n(cid:104)g( \u00afw), \u00afw \u2212 u(cid:105) \u2264 2\u03ba\u0398\nT\n\n.\n\nIn fact, by more carefully analyzing the requirements of dual extrapolation we have the following.\nCorollary 3.3. Suppose in Algorithm 1, the proximal steps are implemented with \u0001(cid:48)/4\u03ba additive\nerror. Then, the upper bound of the regret in Lemma 3.2 is 2\u03ba\u0398/T + \u0001(cid:48).\n\nWe now state a useful second-order characterization of area-convexity involving a relationship be-\ntween the Jacobian of g and the Hessian of r, which was proved in [She17].\nTheorem 3.4 (Second-order area-convexity, Theorem 1.6 in [She17]). For bilinear minimax objec-\ntives, i.e. whose associated operator g has Jacobian\n\n\u2212M 0\nand for twice-differentiable r, if for all z in the domain,\n\u2212J\n\n(cid:18)\u03ba\u22072r(z)\n\nJ =\n\nthen r is 3\u03ba-area-convex with respect to g.\n\nJ\n\n(cid:18) 0 M(cid:62)\n\n\u03ba\u22072r(z)\n\n(cid:19)\n(cid:19)\n\n,\n\n(cid:23) 0,\n\n6In this regard, it is more similar to the \u201cdual averaging\u201d or \u201clazy\u201d mirror descent setup [Bub15].\n\n7\n\n\fFinally, we complete the outline of the algorithm by stating the speci\ufb01c regularizer we use, which\n\ufb01rst appeared in [She17]. We then prove its 3-area-convexity with respect to g by using Theorem 3.4.\n\nr(x, y) = 2(cid:107)d(cid:107)\u221e\n\nxj log xj + x(cid:62)A(cid:62)(y2)\n\n(6)\n\n\uf8eb\uf8ed10\n\n(cid:88)\n\nj\u2208[n]\n\n\uf8f6\uf8f8 ,\n\n2 (cid:107)y(cid:107)2\n\nwhere (y2) is entry-wise. To give some motivation for this regularizer, one (cid:96)\u221e-strongly convex\n2, but over the (cid:96)\u221e ball, this regularizer has large range. The term x(cid:62)A(cid:62)(y2) in\nregularizer is 1\n(6) captures the curvature required for strong-convexity locally, but has a smaller range due to the\nrestrictions on x, A. The constants chosen were the smallest which satisfy the assumptions of the\nfollowing Lemma 3.5.\nLemma 3.5 (Area-convexity of the Sherman regularizer). For the Jacobian J associated with the\nobjective in (4) and the regularizer r de\ufb01ned in (6), we have\n\n(cid:18)\u22072r(z)\n\nJ\n\n(cid:19)\n\n\u2212J\n\u22072r(z)\n\n(cid:23) 0.\n\nWe now give the proof of Theorem 2.4, requiring some claims in Appendix B.1 for the complexity\nof Algorithm 1. In particular, Appendix B.1 implies that although the minimizer to the proximal\nsteps cannot be computed in closed form because of non-separability, a simple alternating scheme\nconverges to an approximate-minimizer in near-constant time.\n\nmer summand comes from the range of entropy and the latter (cid:13)(cid:13)A(cid:62)(cid:13)(cid:13)\u221e = 2. We may choose\n\nProof of Theorem 2.4. The algorithm is Algorithm 1, using the regularizer r in (6). Clearly, in\nthe feasible region the range of the regularizer is at most 20(cid:107)d(cid:107)\u221e log n + 4(cid:107)d(cid:107)\u221e, where the for-\n\u0398 = O((cid:107)d(cid:107)\u221e log n) to be the range of r to satisfy the assumptions of Lemma 3.2, since for all u,\n(cid:104)\u2207r(\u00afz), \u00afz \u2212 u(cid:105) \u2264 0 \u21d2 V r\nBy Theorem 3.4 and Lemma 3.5, r is 3-area-convex with respect to g. By Corollary 3.3, T =\n12\u0398/\u0001 iterations suf\ufb01ce, implementing each proximal step to \u0001/12-additive accuracy. Finally, using\nTheorem B.5 to bound this implementation runtime concludes the proof.\n4 Rounding to Ur,c\n\n\u00afz (u) \u2264 r(u) \u2212 r(\u00afz).\n\nWe state the rounding procedure in [AWR17] for completeness here, which takes a transport plan \u02dcX\nclose to Ur,c and transforms it into a plan which exactly meets the constraints and is close to \u02dcX in\n(cid:96)1, and then prove its correctness in Appendix C. Throughout r(X) def= X1, c(X) def= X(cid:62)1.\n\nAlgorithm 2 \u02c6X = Rounding( \u02dcX, r, c): Rounding to feasible polytope\n\n(cid:16)\n\n(cid:16) r\n(cid:16) c\n\nr( \u02dcX)\n\n, 1\n\n(cid:17)(cid:17) \u02dcX.\n(cid:17)(cid:17)\n\n(cid:16)\n\nmin\n\nX(cid:48) \u2190 diag\nX(cid:48)(cid:48) \u2190 X(cid:48)diag\ner \u2190 r \u2212 1(cid:62)r(X(cid:48)(cid:48)), ec \u2190 c \u2212 1(cid:62)c(X(cid:48)(cid:48)), E \u2190 1(cid:62)er.\n\u02c6X \u2190 X(cid:48)(cid:48) + 1\nreturn \u02c6X.\n\nE ere(cid:62)\nc .\n\nc(X(cid:48)) , 1\n\nmin\n\n.\n\n5 Experiments\n\nWe show experiments illustrating the potential of our algorithm to be useful in practice, by consider-\ning its performance on computing optimal transport distances on the MNIST dataset and comparing\nagainst algorithms in the literature including APDAMD [LHJ19] and Sinkhorn iteration. All com-\nparisons are based on the number of matrix-vector multiplications (rather than iterations, due to our\nalgorithm\u2019s alternating subroutine), the main computational component of all algorithms considered.\n\n8\n\n\f(a) Comparison with Sinkhorn iteration with different\nparameters.\n\n(b) Comparison with APDAMD [LHJ19] with differ-\nent parameters.\n\n(a) Comparison with Sinkhorn iteration on 20 ran-\ndomly chosen MNIST digit pairs.\n\n(b) Comparison with APDAMD [LHJ19] on 20 ran-\ndomly chosen MNIST digit pairs.\n\nWhile our unoptimized algorithm performs poorly, slightly optimizing the size of the regularizer and\nstep sizes used results in an algorithm with competitive performance to APDAMD, the \ufb01rst-order\nmethod with the best provable guarantees and observed practical performance. Sinkhorn iteration\noutperformed all \ufb01rst-order methods experimentally; however, an optimized version of our algorithm\nperformed better than conservatively-regularized Sinkhorn iteration, and was more competitive with\nvariants of Sinkhorn found in practice than other \ufb01rst-order methods.\nAs we discuss in our implementation details (Appendix D), we acknowledge that implementations\nof our algorithm illustrated are not the same as those with provable guarantees in our paper. How-\never, we believe that our modi\ufb01cations are justi\ufb01able in theory, and consistent with those made in\npractice to existing algorithms. Further, we hope that studying the modi\ufb01cations we made (step\nsize, using mirror prox [Nem04] for stability considerations), as well as the consideration of other\nnumerical speedups such as greedy updates [AWR17] or kernel approximations [ABRW18], will\nbecome fruitful for understanding the potential of accelerated \ufb01rst-order methods in both the theory\nand practice of computational optimal transport.\n\n9\n\n\fAcknowledgements\n\nWe thank Jose Blanchet and Carson Kent\nported by NSF Graduate Fellowship DGE-114747.\nREER Award CCF-1844855.\n1656518.\n\nfor helpful conversations.\n\nsup-\nAS was supported by NSF CA-\nKT was supported by NSF Graduate Fellowship DGE-\n\nAJ was\n\nReferences\n[ABRW18] Jason Altschuler, Francis Bach, Alessandro Rudi, and Jonathan Weed. Approximating\nthe quadratic transportation metric in near-linear time. CoRR, abs/1810.10046, 2018.\n1.2, 5, D\n\n[ACB17] Mart\u00b4\u0131n Arjovsky, Soumith Chintala, and L\u00b4eon Bottou. Wasserstein generative adver-\nIn Proceedings of the 34th International Conference on Machine\nsarial networks.\nLearning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 214\u2013223,\n2017. 1\n\n[ANOY14] Alexandr Andoni, Aleksandar Nikolov, Krzysztof Onak, and Grigory Yaroslavtsev.\nParallel algorithms for geometric graph problems. In Symposium on Theory of Com-\nputing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 574\u2013583,\n2014. 1.2\n\n[AO15] Zeyuan Allen Zhu and Lorenzo Orecchia. Nearly-linear time positive LP solver with\nfaster convergence rate. In Proceedings of the Forty-Seventh Annual ACM on Sym-\nposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015,\npages 229\u2013236, 2015. 1.2\n\n[AS14] Pankaj K. Agarwal and R. Sharathkumar. Approximation algorithms for bipartite\nmatching with metric and geometric costs. In Symposium on Theory of Computing,\nSTOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 555\u2013564, 2014. 1.2\n\n[AWR17] Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approxima-\ntion algorithms for optimal transport via sinkhorn iteration. In Advances in Neural\nInformation Processing Systems 30: Annual Conference on Neural Information Pro-\ncessing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 1961\u20131971,\n2017. 1, 1.2, ??, 2.1, 2.2, 4, 5, 2.2, D\n\n[BJKS18] Jose Blanchet, Arun Jambulapati, Carson Kent, and Aaron Sidford. Towards optimal\nrunning times for optimal transport. CoRR, abs/1810.07717, 2018. (document), 1,\n1.1, 1.2, ??, ??, 2.1\n\n[BK17] Jose H. Blanchet and Yang Kang. Distributionally robust groupwise regularization\nestimator. In Proceedings of The 9th Asian Conference on Machine Learning, ACML\n2017, Seoul, Korea, November 15-17, 2017., pages 97\u2013112, 2017. 1\n\n[Bub15] S\u00b4ebastien Bubeck. Convex optimization: Algorithms and complexity. Foundations\n\nand Trends in Machine Learning, 8(3-4):231\u2013357, 2015. 6\n\n[BvdPPH11] Nicolas Bonneel, Michiel van de Panne, Sylvain Paris, and Wolfgang Heidrich.\nDisplacement interpolation using lagrangian mass transport. ACM Trans. Graph.,\n30(6):158:1\u2013158:12, 2011. 1\n\n[CK18] Deeparnab Chakrabarty and Sanjeev Khanna. Better and simpler error analysis of\nthe sinkhorn-knopp algorithm for matrix scaling. In 1st Symposium on Simplicity in\nAlgorithms, SOSA 2018, January 7-10, 2018, New Orleans, LA, USA, pages 4:1\u20134:11,\n2018. 1\n\n[CMTV17] Michael B. Cohen, Aleksander Madry, Dimitris Tsipras, and Adrian Vladu. Matrix\nscaling and balancing via box constrained newton\u2019s method and interior point meth-\nods. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS\n2017, Berkeley, CA, USA, October 15-17, 2017, pages 902\u2013913, 2017. 1.2\n\n10\n\n\f[Cut13] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In\nAdvances in Neural Information Processing Systems 26: 27th Annual Conference on\nNeural Information Processing Systems 2013. Proceedings of a meeting held Decem-\nber 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2292\u20132300, 2013. 1, 1.2\n\n[DGK18] Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational opti-\nmal transport: Complexity by accelerated gradient descent is better than by sinkhorn\u2019s\nalgorithm. In Proceedings of the 35th International Conference on Machine Learn-\ning, ICML 2018, Stockholmsm\u00a8assan, Stockholm, Sweden, July 10-15, 2018, pages\n1366\u20131375, 2018. (document), 1, 1.1, 1.2, ??, 1, 1.2\n\n[DO19] Jelena Diakonikolas and Lorenzo Orecchia. The approximate duality gap technique:\nA uni\ufb01ed theory of \ufb01rst-order methods. SIAM Journal on Optimization, 29(1):660\u2013\n689, 2019. 1.2\n\n[EK18] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust\noptimization using the wasserstein metric: performance guarantees and tractable re-\nformulations. Math. Program., 171(1-2):115\u2013166, 2018. 1\n\n[GCPB16] Aude Genevay, Marco Cuturi, Gabriel Peyr\u00b4e, and Francis R. Bach. Stochastic opti-\nmization for large-scale optimal transport. In Advances in Neural Information Pro-\ncessing Systems 29: Annual Conference on Neural Information Processing Systems\n2016, December 5-10, 2016, Barcelona, Spain, pages 3432\u20133440, 2016. 1, 1.2\n\n[HK73] John E. Hopcroft and Richard M. Karp. An n5/2 algorithm for maximum matchings\n\nin bipartite graphs. SIAM J. Comput., 2(4):225\u2013231, 1973. 1.2\n\n[KLOS14] Jonathan A. Kelner, Yin Tat Lee, Lorenzo Orecchia, and Aaron Sidford. An almost-\nlinear-time algorithm for approximate max \ufb02ow in undirected graphs, and its multi-\ncommodity generalizations. In Proceedings of the Twenty-Fifth Annual ACM-SIAM\nSymposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January\n5-7, 2014, pages 217\u2013226, 2014. 1.2\n\n[LHJ19] Tianyi Lin, Nhat Ho, and Michael I. Jordan. On ef\ufb01cient optimal transport: An anal-\nysis of greedy and accelerated mirror descent algorithms. CoRR, abs/1901.06482,\n2019. (document), 1, 1.1, 1.2, ??, 1, 1.2, 5, 2b, 3b, D\n\n[LMR19] Nathaniel Lahn, Deepika Mulchandani, and Sharath Raghvendra. A graph theoretic\n\nadditive approximation of optimal transport. CoRR, abs/1905.11830, 2019. 1.2\n\n[LS14] Yin Tat Lee and Aaron Sidford. Path \ufb01nding methods for linear programming: Solving\nlinear programs in \u02dco(vrank) iterations and faster algorithms for maximum \ufb02ow.\nIn\n55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014,\nPhiladelphia, PA, USA, October 18-21, 2014, pages 424\u2013433, 2014. 1.2\n\n[LS15] Yin Tat Lee and Aaron Sidford. Ef\ufb01cient inverse maintenance and faster algorithms\nfor linear programming. In IEEE 56th Annual Symposium on Foundations of Com-\nputer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 230\u2013249,\n2015. 1.2, ??\n\n[Nem04] Arkadi Nemirovski. Prox-method with rate of convergence o(1/t) for variational in-\nequalities with lipschitz continuous monotone operators and smooth convex-concave\nsaddle point problems. SIAM Journal on Optimization, 15(1):229\u2013251, 2004. 1.2, 5,\nD\n\n[Nes05] Yurii Nesterov. Smooth minimization of non-smooth functions. Math. Program.,\n\n103(1):127\u2013152, 2005. 1.2, 1.2\n\n[Nes07] Yurii Nesterov. Dual extrapolation and its applications to solving variational inequal-\nities and related problems. Math. Program., 109(2-3):319\u2013344, 2007. 1.2, 3, 3.1,\n3.1\n\n11\n\n\f[PZ16] Victor M. Panaretos and Yoav Zemel. Amplitude and phase variation of point pro-\n\ncesses. Annals of Statistics, 44(2):771\u2013812, 2016. 1\n\n[Qua19] Kent Quanrud. Approximating optimal transport with linear programs. In 2nd Sym-\nposium on Simplicity in Algorithms, SOSA@SODA 2019, January 8-9, 2019 - San\nDiego, CA, USA, pages 6:1\u20136:9, 2019. (document), 1, 1.1, 1.2, ??, 2.1\n\n[SA12] R. Sharathkumar and Pankaj K. Agarwal. A near-linear time \u0001-approximation algo-\nIn Proceedings of the 44th Symposium on\nrithm for geometric bipartite matching.\nTheory of Computing Conference, STOC 2012, New York, NY, USA, May 19 - 22,\n2012, pages 385\u2013394, 2012. 1.2\n\n[San09] Piotr Sankowski. Maximum weight bipartite matching in matrix multiplication time.\n\nTheor. Comput. Sci., 410(44):4480\u20134488, 2009. 1.2\n\n[SdGP+15] Justin Solomon, Fernando de Goes, Gabriel Peyr\u00b4e, Marco Cuturi, Adrian Butscher,\nAndy Nguyen, Tao Du, and Leonidas J. Guibas. Convolutional wasserstein dis-\ntances: ef\ufb01cient optimal transportation on geometric domains. ACM Trans. Graph.,\n34(4):66:1\u201366:11, 2015. 1\n\n[She13] Jonah Sherman. Nearly maximum \ufb02ows in nearly linear time. In 54th Annual IEEE\nSymposium on Foundations of Computer Science, FOCS 2013, 26-29 October, 2013,\nBerkeley, CA, USA, pages 263\u2013269, 2013. 1.2, 2.1\n\n[She17] Jonah Sherman. Area-convexity, l\u221e regularization, and undirected multicommodity\nIn Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of\n\ufb02ow.\nComputing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pages 452\u2013460,\n2017. (document), 1.1, 1.2, 3, 3.1, 3.1, 3.1, 3.4, 3.1\n\n[ST18] Aaron Sidford and Kevin Tian. Coordinate methods for accelerating (cid:96)\u221e regression\nand faster approximate maximum \ufb02ow. In 59th Annual IEEE Symposium on Founda-\ntions of Computer Science, FOCS 2018, 7-9 October, 2018, Paris, France, 2018. 1.2,\n1.2\n\n[You01] Neal E. Young. Sequential and parallel algorithms for mixed packing and covering.\nIn 42nd Annual Symposium on Foundations of Computer Science, FOCS 2001, 14-17\nOctober 2001, Las Vegas, Nevada, USA, pages 538\u2013546, 2001. 1.2\n\n[ZLdOW17] Zeyuan Allen Zhu, Yuanzhi Li, Rafael Mendes de Oliveira, and Avi Wigderson. Much\nfaster algorithms for matrix scaling. In 58th IEEE Annual Symposium on Foundations\nof Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages\n890\u2013901, 2017. 1.2\n\n12\n\n\f", "award": [], "sourceid": 6067, "authors": [{"given_name": "Arun", "family_name": "Jambulapati", "institution": "Stanford University"}, {"given_name": "Aaron", "family_name": "Sidford", "institution": "Stanford"}, {"given_name": "Kevin", "family_name": "Tian", "institution": "Stanford University"}]}