{"title": "Computing Kantorovich-Wasserstein Distances on $d$-dimensional histograms using $(d+1)$-partite graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 5793, "page_last": 5803, "abstract": "This paper presents a novel method to compute the exact Kantorovich-Wasserstein distance between a pair of $d$-dimensional histograms having $n$ bins each. We prove that this problem is equivalent to an uncapacitated minimum cost flow problem on a $(d+1)$-partite graph with $(d+1)n$ nodes and $dn^{\\frac{d+1}{d}}$ arcs, whenever the cost is separable along the principal $d$-dimensional directions. We show numerically the benefits of our approach by computing the Kantorovich-Wasserstein distance of order 2 among two sets of instances: gray scale images and $d$-dimensional biomedical histograms. On these types of instances, our approach is competitive with state-of-the-art optimal transport algorithms.", "full_text": "Computing Kantorovich-Wasserstein Distances on\n\nd-dimensional histograms using (d + 1)-partite graphs\n\nGennaro Auricchio, Stefano Gualandi, Marco Veneroni\n\nUniversit\u00e0 degli Studi di Pavia, Dipartimento di Matematica \u201cF. Casorati\"\n\ngennaro.auricchio01@universitadipavia.it,\n\nstefano.gualandi@unipv.it, marco.veneroni@unipv.it\n\nFederico Bassetti\n\nPolitecnico di Milano, Dipartimento di Matematica\n\nfederico.bassetti@polimi.it\n\nAbstract\n\nThis paper presents a novel method to compute the exact Kantorovich-Wasserstein\ndistance between a pair of d-dimensional histograms having n bins each. We prove\nthat this problem is equivalent to an uncapacitated minimum cost \ufb02ow problem on\na (d + 1)-partite graph with (d + 1)n nodes and dn\nd arcs, whenever the cost\nis separable along the principal d-dimensional directions. We show numerically\nthe bene\ufb01ts of our approach by computing the Kantorovich-Wasserstein distance\nof order 2 among two sets of instances: gray scale images and d-dimensional bio\nmedical histograms. On these types of instances, our approach is competitive with\nstate-of-the-art optimal transport algorithms.\n\nd+1\n\n1\n\nIntroduction\n\nThe computation of a measure of similarity (or dissimilarity) between pairs of objects is a crucial\nsubproblem in several applications in Computer Vision [24, 25, 22], Computational Statistic [17],\nProbability [6, 8], and Machine Learning [29, 12, 14, 5]. In mathematical terms, in order to compute\nthe similarity between a pair of objects, we want to compute a distance. If the distance is equal to\nzero the two objects are considered to be equal; the more the two objects are different, the greater is\ntheir distance value. For instance, the Euclidean norm is the most used distance function to compare a\npair of points in Rd. Note that the Euclidean distance requires only O(d) operations to be computed.\nWhen computing the distance between complex discrete objects, such as for instance a pair of discrete\nmeasures, a pair of images, a pair of d-dimensional histograms, or a pair of clouds of points, the\nKantorovich-Wasserstein distance [31, 30] has proved to be a relevant distance function [24], which\nhas both nice mathematical properties and useful practical implications. Unfortunately, computing\nthe Kantorovich-Wasserstein distance requires the solution of an optimization problem. Even if\nthe optimization problem is polynomially solvable, the size of practical instances to be solved is\nvery large, and hence the computation of Kantorovich-Wasserstein distances implies an important\ncomputational burden.\nThe optimization problem that yields the Kantorovich-Wasserstein distance can be solved with\ndifferent methods. Nowadays, the most popular methods are based on (i) the Sinkhorn\u2019s algorithm\n[11, 28, 3], which solves (heuristically) a regularized version of the basic optimal transport problem,\nand (ii) Linear Programming-based algorithms [13, 15, 20], which exactly solve the basic optimal\ntransport problem by formulating and solving an equivalent uncapacitated minimum cost \ufb02ow\nproblem. For a nice overview of both computational approaches, we refer the reader to Chapters 2\nand 3 in [23], and the references therein contained.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this paper, we propose a Linear Programming-based method to speed up the computation of\nKantorovich-Wasserstein distances of order 2, which exploits the structure of the ground distance\nto formulate an uncapacitated minimum cost \ufb02ow problem. The \ufb02ow problem is then solved with a\nstate-of-the-art implementation of the well-known Network Simplex algorithm [16].\nOur approach is along the line of research initiated in [19], where the authors proposed a very ef\ufb01cient\nmethod to compute Kantorovich-Wasserstein distances of order 1 (i.e., the so\u2013called Earth Mover\nDistance), whenever the ground distance between a pair of points is the (cid:96)1 norm. In [19], the structure\nof the (cid:96)1 ground distance and of regular d-dimensional histograms is exploited to de\ufb01ne a very small\n\ufb02ow network. More recently, this approach has been successfully generalized in [7] to the case of\n(cid:96)\u221e and (cid:96)2 norms, providing both exact and approximations algorithms, which are able to compute\ndistances between pairs of 512 \u00d7 512 gray scale images. The idea of speeding up the computation\nof Kantorovich-Wasserstein distances by de\ufb01ning a minimum cost \ufb02ow on smaller structured \ufb02ow\nnetworks is also used in [22], where a truncated distance is used as ground distance in place of a (cid:96)p\nnorm.\nThe outline of this paper is as follows. Section 2 reviews the basic notion of discrete optimal transport\nand \ufb01xes the notation. Section 3 contains our main contribution, that is, Theorem 1 and Corollary 2,\nwhich permits to speed-up the computation of Kantorovich-Wasserstein distances of order 2 under\nquite general assumptions. Section 4 presents numerical results of our approaches, compared with\nthe Sinkhorn\u2019s algorithm as implemented in [11] and a standard Linear Programming formulation on\na complete bipartite graph [24]. Finally, Section 5 concludes the paper.\n\n2 Discrete Optimal Transport: an Overview\n\nLet X and Y be two discrete spaces. Given two probability vectors \u00b5 and \u03bd de\ufb01ned on X and Y ,\nrespectively, and a cost c : X \u00d7 Y \u2192 R+, the Kantorovich-Rubinshtein functional between \u00b5 and \u03bd\nis de\ufb01ned as\n\n(cid:88)\n\nWc(\u00b5, \u03bd) = inf\n\n\u03c0\u2208\u03a0(\u00b5,\u03bd)\n\nc(x, y)\u03c0(x, y)\n\n(1)\n\n(x,y)\u2208X\u00d7Y\n\ny\u2208Y \u03c0(x, y) = \u00b5(x) and(cid:80)\n\nprobability measures \u03c0 such that(cid:80)\n\nwhere \u03a0(\u00b5, \u03bd) is the set of all the probability measures on X \u00d7 Y with marginals \u00b5 and \u03bd, i.e. the\nx\u2208X \u03c0(x, y) = \u03bd(y), for every (x, y)\nin X \u00d7 Y . Such probability measures are sometimes called transport plans or couplings for \u00b5 and \u03bd.\nAn important special case is when X = Y and the cost function c is a distance on X. In this case\nWc is a distance on the simplex of probability vectors on X, also known as Kantorovich-Wasserstein\ndistance of order 1.\nWe remark that the Kantorovich-Wasserstein distance of order p can be de\ufb01ned, more in general,\nfor arbitrary probability measures on a metric space (X, \u03b4) by\n\n(cid:18)\n\n(cid:90)\n\n(cid:19)min(1/p,1)\n\ninf\n\nX\u00d7X\n\n\u03c0\u2208\u03a0(\u00b5,\u03bd)\n\nWp(\u00b5, \u03bd) :=\n\n\u03b4p(x, y)\u03c0(dxdy)\n\n(2)\nwhere now \u03a0(\u00b5, \u03bd) is the set of all probability measures on the Borel sets of X \u00d7 X that have\nmarginals \u00b5 and \u03bd, see, e.g., [4]. The in\ufb01mum in (2) is attained, and any probability \u03c0 which realizes\nthe minimum is called an optimal transport plan.\nThe Kantorovich-Rubinshtein transport problem in the discrete setting can be seen as a special case of\nthe following Linear Programming problem, where we assume now that \u00b5 and \u03bd are generic vectors\nof dimension n, with positive components,\n\n(P ) min\n\nc(x, y)\u03c0(x, y)\n\n(cid:88)\n\n(cid:88)\ns.t. (cid:88)\n(cid:88)\n\ny\u2208Y\n\nx\u2208X\n\ny\u2208Y\n\u03c0(x, y) \u2264 \u00b5(x)\n\n\u03c0(x, y) \u2265 \u03bd(y)\n\nx\u2208X\n\u03c0(x, y) \u2265 0.\n\n2\n\n\u2200x \u2208 X\n\n\u2200y \u2208 Y\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n\fIf(cid:80)\nx \u00b5(x) = (cid:80)\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Two given 2-dimensional histograms of size N \u00d7 N, with N = 3; (b) Complete bipartite\ngraph with N 4 arcs; (c): 3-partite graph with (d + 1)N 3 arcs.\n\ny \u03bd(y) we have the so-called balanced transportation problem, otherwise the\ntransportation problem is said to be unbalanced [18, 10]. For balanced optimal transport problems,\nconstraints (4) and (5) must be satis\ufb01ed with equality, and the problem reduces to the Kantorovich\ntransport problem (up to normalization of the vectors \u00b5 and \u03bd).\nProblem (P) is related to the so-called Earth Mover\u2019s distance. In this case, X, Y \u2282 Rd, x and y are\nthe centers of two data clusters, and \u00b5(x) and \u03bd(y) give the number of points in the respective cluster.\nFinally, c(x, y) is some measure of dissimilarity between the two clusters x and y. Once the optimal\ntransport \u03c0\u2217 is determined, the Earth Mover\u2019s distance between \u00b5 and \u03bd is de\ufb01ned as (e.g., see [24])\n\n(cid:80)\n\n(cid:80)\n\nx\u2208X\n\nEM D(\u00b5, \u03bd) =\n\n(cid:80)\n(cid:80)\ny\u2208Y c(x, y)\u03c0\u2217(x, y)\n\ny\u2208Y \u03c0\u2217(x, y)\n\nx\u2208X\n\n.\n\nProblem (P) can be formulated as an uncapacitated minimum cost \ufb02ow problem on a bipartite graph\nde\ufb01ned as follows [2]. The bipartite graph has two partitions of nodes: the \ufb01rst partition has a node\nfor each point x of X, and the second partition has a node for each point y of Y . Each node x of the\n\ufb01rst partition has a supply of mass equal to \u00b5(x), each node of the second partition has a demand of\n\u03bd(y) units of mass. The bipartite graph has an (uncapacitated) arc for each element in the Cartesian\nproduct X \u00d7 Y having cost equal to c(x, y). The minimum cost \ufb02ow problem de\ufb01ned on this graph\nyields the optimal transport plan \u03c0\u2217(x, y), which indeed is an optimal solution of problem (3)\u2013(6).\nFor instance, in case of a regular 2D dimensional histogram of size N \u00d7 N, that is, having n = N 2\nbins, we get a bipartite graph with 2N 2 nodes and N 4 arcs (or 2n nodes and n2 arcs). Figure 1\u2013(a)\nshows an example for a 3 \u00d7 3 histogram, and Figure 1\u2013(b) gives the corresponding complete bipartite\ngraph.\nIn this paper, we focus on the case p = 2 in equation (2) and the ground distance function \u03b4 is the\nEuclidean norm (cid:96)2, that is the Kantorovich-Wasserstein distance of order 2, which is denoted by W2.\nWe provide, in the next section, an equivalent formulation on a smaller (d + 1)-partite graph.\n\n3 Formulation on (d + 1)-partite Graphs\n\nFor the sake of clarity, but without loss of generality, we present \ufb01rst our construction considering 2-\ndimensional histograms and the (cid:96)2 Euclidean ground distance. Then, we discuss how our construction\ncan be generalized to any pair of d-dimensional histograms.\nLet us consider the following \ufb02ow problem: let \u00b5 and \u03bd be two probability measures over a N \u00d7 N\nregular grid denoted by G. In the following paragraphs, we use the notation sketched in Figure 2. In\naddition, we de\ufb01ne the set U := {1, . . . , N}.\n\n3\n\n\fFigure 2: Basic notation used in Section 3: in order to\nsend a unit of \ufb02ow from point (a, j) to point (i, b), we ei-\nther send a unit of \ufb02ow directly along arc ((a, j), (i, b))\nof cost c((a, j), (i, b)) = (a \u2212 i)2 + (j \u2212 b)2, or, we\n\ufb01rst send a unit of \ufb02ow from (a, j) to (i, j), and then\nfrom (i, j) to (i, b), having total cost c((a, j), (i, j)) +\nc((i, j), (i, b)) = (a\u2212i)2+(j\u2212j)2+(i\u2212i)2+(j\u2212b)2 =\n(a \u2212 i)2 + (j \u2212 b)2 = c((a, j), (i, b)). Indeed, the cost\nof the two different path is exactly the same.\n\nR : (F1, F2) \u2192 N(cid:88)\n\n(cid:34) N(cid:88)\n\nN(cid:88)\n\n(cid:35)\n\nSince we are considering the (cid:96)2 norm as ground distance, we minimize the functional\n\n(a \u2212 i)2f (1)\n\na,i,j +\n\n(j \u2212 b)2f (2)\n\ni,j,b\n\n(7)\n\ni,j=1\n\na=1\n\nb=1\n\namong all Fi = {f (i)\nfollowing constraints\n\nf (1)\na,i,j = \u00b5a,j,\n\na,b,c}, with a, b, c \u2208 {1, ..., N} real numbers (i.e., \ufb02ow variables) satisfying the\nN(cid:88)\nN(cid:88)\n(cid:88)\n\n\u2200a, j \u2208 U \u00d7 U\n\n\u2200i, b \u2208 U \u00d7 U\n\nf (2)\ni,j,b = \u03bdi,b,\n\n(8)\n\n(9)\n\n(cid:88)\n\n\u2200i, j \u2208 U \u00d7 U, a \u2208 U, b \u2208 U.\n\ni=1\n\nj=1\n\n(10)\n\nf (1)\na,i,j =\n\nf (2)\ni,j,b,\n\na\n\nb\n\nConstraints (8) impose that the mass \u00b5a,j at the point (a, j) is moved to the points (k, j)k=1,...,N .\nConstraints (9) force the point (i, b) to receive from the points (i, l)l=1,...,N a total mass of \u03bdi,b.\nConstraints (10) require that all the mass that goes from the points (a, j)a=1,...,N to the point (i, j)\nis moved to the points (i, b)b=1,...,N . We call a pair (F1, F2) satisfying the constraints (8)\u2013(10) a\nfeasible \ufb02ow between \u00b5 and \u03bd. We denote by F(\u00b5, \u03bd) the set of all feasible \ufb02ows between \u00b5 and \u03bd.\nIndeed, we can formulate the minimization problem de\ufb01ned by (7)\u2013(10) as an uncapacitated minimum\ncost \ufb02ow problem on a tripartite graph T = (V, A). The set of nodes of T is V := V (1)\u222a V (2)\u222a V (3),\nwhere V (1), V (2) and V (3) are the nodes corresponding to three N \u00d7 N regular grids. We denote by\n(i, j)(l) the node of coordinates (i, j) in the grid V (l). We de\ufb01ne the two disjoint set of arcs between\nthe successive pairs of node partitions as\n\nA(1)\nA(2)\n\n:= {((a, j)(1), (i, j)(2)) | i, a, j \u2208 U},\n:= {((i, j)(2), (i, b)(3)) | i, b, j \u2208 U},\n\n(11)\n(12)\nand, hence, the arcs of T are A := A(1) \u222a A(2). Note that in this case the graph T has 3N 2 nodes\nand 2N 3 arcs. Whenever (F1, F2) is a feasible \ufb02ow between \u00b5 and \u03bd, we can think of the values\nf (1)\na,i,j as the quantity of mass that travels from (a, j) to (i, j) or, equivalently, that moves along the\narc ((a, j), (i, j)) of the tripartite graph, while the values f (2)\ni,j,b are the mass moving along the arc\n((i, j), (i, b)) (e.g., see Figures 1\u2013(c) and 2).\nNow we can give an idea of the roles of the sets V (1), V (2) and V (3): V (1) is the node set where is\ndrawn the initial distribution \u00b5, while on V (3) it is drawn the \ufb01nal con\ufb01guration of the mass \u03bd. The\nnode set V (2) is an auxiliary grid that hosts an intermediate con\ufb01guration between \u00b5 and \u03bd.\nWe are now ready to state our main contribution.\nTheorem 1. For each measure \u03c0 on G \u00d7 G that transports \u00b5 into \u03bd, we can \ufb01nd a feasible \ufb02ow\n(F1, F2) such that\n\nR(F1, F2) =\n\n((a \u2212 i)2 + (b \u2212 j)2)\u03c0(a,j),(i,b)).\n\n(13)\n\n(cid:88)\n\n((a,j),(i,b))\n\n4\n\n\f((a,j),(i,b))\n\n(cid:88)\n\uf8ee\uf8f0(cid:88)\n(cid:88)\ni,j,b =(cid:80)\nn(cid:88)\n\na,b\n\nj,i\n\n=\n\n(cid:34) n(cid:88)\n\n(a \u2212 i)2\u03c0((a,j),(i,b)) +\n\n(j \u2212 b)2\u03c0((a,j),(i,b))\n\n\uf8f9\uf8fb .\n\n(cid:88)\n\na,b\n\nn(cid:88)\n\n(j \u2212 b)2f (2)\n\ni,j,b\n\n(cid:35)\n\n.\n\nProof. (Sketch). We will only show how to build a feasible \ufb02ow starting from a transport plan, the\ninverse building uses a more technical lemma (the so\u2013called gluing lemma [4, 31]) and can be found\nin the Additional Material. Let \u03c0 be a transport plan, if we write explicitly the ground distance\n(cid:96)2((a, j), (i, b)) we \ufb01nd that\n\n(cid:96)2((a, j), (i, b))\u03c0((a,j),(i,b)) =\n\n((a \u2212 i)2 + (j \u2212 b)2)\u03c0((a,j),(i,b))\n\n(cid:88)\n\n((a,j),(i,b))\n\na,i,j =(cid:80)\n(cid:88)\n\nIf we set f (1)\n\nb \u03c0((a,j),(i,b)) and f (2)\n\na \u03c0((a,j),(i,b)) we \ufb01nd\n\n(cid:96)2((a, j), (i, b))\u03c0((a,j),(i,b)) =\n\n(a \u2212 i)2f (1)\n\na,i,j +\n\n((a,j),(i,b))\n\ni,j\n\na\n\nb\n\nIn order to conclude we have to prove that those f (1)\nBy de\ufb01nition we have\n\na,i,j and f (2)\n\ni,j,b satisfy the constraints (8)\u2013(10).\n\nf (1)\na,i,j =\n\n\u03c0((a,j),(i,b)) = \u00b5a,j,\n\nthus proving (8); similarly, it is possible to check constraint (9). The constraint (10) also follows\neasily since\n\n(cid:88)\n(cid:88)\n\ni\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nb\n\ni\n\n(cid:88)\n\nf (1)\na,i,j =\n\n\u03c0((a,j),(i,b)) =\n\nf (2)\ni,j,b.\n\na\n\na\n\nb\n\nb\n\nAs a straightforward, yet fundamental, consequence we have the following result.\nCorollary 1. If we set c((a, j), (i, b)) = (a \u2212 i)2 + (j \u2212 b)2 then, for any discrete measures \u00b5 and\n\u03bd, we have that\n\nW 2\n\n2 (\u00b5, \u03bd) = min\nF(\u00b5,\u03bd)\n\nR(F1, F2).\n\n(14)\n\nIndeed, we can compute the Kantorovich-Wasserstein distance of order 2 between a pair of discrete\nmeasures \u00b5, \u03bd, by solving an uncapacitated minimum cost \ufb02ow problem on the given tripartite graph\nT := (V (1) \u222a V (2) \u222a V (3), A(1) \u222a A(2)).\nWe remark that our approach is very general and it can be directly extended to deal with the following\ngeneralizations.\n\nMore general cost functions. The structure that we have exploited of the Euclidean distance (cid:96)2 is\npresent in any cost function c : G \u00d7 G \u2192 [0,\u221e] that is separable, i.e., has the form\n\nc(x, y) = c(1)(x1, y1) + c(2)(x2, y2),\n\nwhere both c(1) and c(2) are positive real valued functions de\ufb01ned over G. We remark that the whole\nclass of costs cp(x, y) = (x1 \u2212 y1)p + (x2 \u2212 y2)p is of that kind, so we can compute any of the\nKantorovich-Wasserstein distances related to each cp.\n\nHigher dimensional grids. Our approach can handle discrete measures in spaces of any dimension\nd, that is, for instance, any d-dimensional histogram. In dimension d = 2, we get a tripartite\ngraph because we decomposed the transport along the two main directions. If we have a problem\nin dimension d, we need a (d + 1)-plet of grids connected by arcs oriented as the d fundamental\ndirections, yielding a (d + 1)-partite graph. As the dimension d grows, our approach gets faster and\nmore memory ef\ufb01cient than the standard formulation given on a bipartite graph.\nIn the Additional Material, we present a generalization of Theorem 1 to any dimension d and to\nseparable cost functions c(x, y).\n\n5\n\n\fFigure 3: DOTmark benchmark: Classic, Microscopy, and Shapes images.\n\n4 Computational Results\n\nIn this section, we report the results obtained on two different set of instances. The goal of our\nexperiments is to show how our approach scales with the size of the histogram N and with the\ndimension of the histogram d. As cost distance c(x, y), with x, y \u2208 Rd, we use the squared (cid:96)2 norm.\nAs problem instances, we use the gray scale images (i.e., 2-dimensional histograms) proposed by\nthe DOTMark benchmark [26], and a set of d-dimensional histograms obtained by bio medical data\nmeasured by \ufb02ow cytometer [9].\n\nImplementation details. We run our experiments using the Network Simplex as implemented in\nthe Lemon C++ graph library1, since it provides the fastest implementation of the Network Simplex\nalgorithm to solve uncapacitated minimum cost \ufb02ow problems [16]. We did try other state-of-the-art\nimplementations of combinatorial algorithm for solving min cost \ufb02ow problems, but the Network\nSimplex of the Lemon graph library was the fastest by a large margin. The tests are executed on a\ngaming laptop with Windows 10 (64 bit), equipped with an Intel i7-6700HQ CPU and 16 GB of Ram.\nThe code was compiled with MS Visual Studio 2017, using the ANSI standard C++17. The code\nexecution is single threaded. The Matlab implementation of the Sinkhorn\u2019s algorithm [11] runs in\nparallel on the CPU cores, but we do not use any GPU in our test. The C++ and Matlab code we used\nfor this paper is freely available at http://stegua.github.io/dpartion-nips2018.\n\nResults for the DOTmark benchmark. The DOTmark benchmark contains 10 classes of gray\nscale images related to randomly generated images, classical images, and real data from microscopy\nimages of mitochondria [26]. In each class there are 10 different images. Every image is given in the\ndata set at the following pixel resolutions: 32 \u00d7 32, 64 \u00d7 64, 128 \u00d7 128, 256 \u00d7 256, and 512 \u00d7 512.\nThe images in Figure 3 are respectively the ClassicImages, Microscopy, and Shapes images (one\nclass for each row), shown at highest resolution.\nIn our test, we \ufb01rst compared \ufb01ve approaches to compute the Kantorovich-Wasserstein distances on\nimages of size 32 \u00d7 32:\n\n1. EMD: The implementation of Transportation Simplex provided by [24], known in the\nliterature as EMD code, that is an exact general method to solve optimal transport problem.\nWe used the implementation in the programming language C, as provided by the authors,\nand compiled with all the compiler optimization \ufb02ags active.\n\n2. Sinkhorn: The Matlab implementation of the Sinkhorn\u2019s algorithm2 [11], that is an approx-\nimate approach whose performance in terms of speed and numerical accuracy depends on\na parameter \u03bb: for smaller values of \u03bb, the algorithm is faster, but the solution value has a\nlarge gap with respect to the optimal value of the transportation problem; for larger values\nof \u03bb, the algorithm is more accurate (i.e., smaller gap), but it becomes slower. Unfortunately,\nfor very large value of \u03bb the method becomes numerically unstable. The best value of \u03bb\nis very problem dependent. In our tests, we used \u03bb = 1 and \u03bb = 1.5. The second value,\n\n1http://lemon.cs.elte.hu (last visited on October, 26th, 2018)\n2http://marcocuturi.net/SI.html (last visited on October, 26th, 2018)\n\n6\n\n\f\u03bb = 1.5, is the largest value we found for which the algorithm computes the distances for\nall the instances considered without facing numerical issues.\n\n3. Improved Sinkhorn: We implemented in Matlab an improved version of the Sinkhorn\u2019s\nalgorithm, specialized to compute distances over regular 2-dimensional grids [28, 27].\nThe main idea is to improve the matrix-vector operations that are the true computational\nbottleneck of Sinkhorn\u2019s algorithm, by exploiting the structure of the cost matrix. Indeed,\nthere is a parallelism with our approach to the method presented in [28], since both exploits\nthe geometric cost structure. In [28], the authors proposes a general method that exploits a\nheat kernel to speed up the matrix-vector products. When the discrete measures are de\ufb01ned\nover a regular 2-dimensional grid, the cost matrix used by the Sinkhorn\u2019s algorithm can be\nobtained using a Kronecker product of two smaller matrices. Hence, instead of performing\na matrix-vector product using a matrix of dimension N \u00d7 N, we perform two matrix-matrix\nproducts over matrices of dimension\nN, yielding a signi\ufb01cant runtime improvement.\nIn addition, since the smaller matrices are Toeplitz matrices, they can be embedded into\ncirculant matrices, and, as consequence, it is possible to employ a Fast Fourier Transform\napproach to further speed up the computation. Unfortunately, the Fast Fourier Transform\nmakes the approach still more numerical unstable, and we did not used it in our \ufb01nal\nimplementation.\n\nN \u00d7\u221a\n\n\u221a\n\n4. Bipartite: The bipartite formulation presented in Figure 1\u2013(b), which is the same as [24],\n\nbut it is solved with the Network Simplex implemented in the Lemon Graph library [16].\n\n5. 3-partite: The 3-partite formulation proposed in this paper, which for 2-dimensional his-\ntograms is represented in 1\u2013(c). Again, we use the Network Simplex of the Lemon Graph\nLibrary to solve the corresponding uncapacitated minimum cost \ufb02ow problem.\n\nopt\n\nTables 1(a) and 1(b) report the averages of our computational results over different classes of images\nof the DOTMark benchmark. Each class of gray scale image contains 10 instances, and we compute\nthe distance between every possible pair of images within the same class: the \ufb01rst image plays the\nrole of the source distribution \u00b5, and the second image gives the target distribution \u03bd. Considering\nall pairs within a class, it gives 45 instances for each class. We report the means and the standard\ndeviations (between brackets) of the runtime, measured in seconds. Table 1(a) shows in the second\ncolumn the runtime for EMD [24]. The third and fourth columns gives the runtime and the optimality\ngap for the Sinkhorn\u2019s algorithm with \u03bb = 1; the 6-th and 7-th columns for \u03bb = 1.5. The percentage\n\u00b7 100, where U B is the upper bound computed by the Sinkhorn\u2019s\ngap is computed as Gap = U B\u2212opt\nalgorithm, and opt is the optimal value computed by EMD. The last two columns report the runtime\nfor the bipartite and 3-partite approaches presented in this paper.\nTable 1(b) compares our 3-partite formulation with the Improved Sinkhorn\u2019s algorithm [28, 27],\nreporting the same statistics of the previous table. In this case, we run the Improved Sinkhorn using\nthree values of the parameter \u03bb, that are, 1.0, 1.25, and 1.5. While the Improved Sinkhorn is indeed\nmuch faster that the general algorithm as presented in [11], it does suffer of the same numerical\nstability issues, and, it can yield very poor percentage gap to the optimal solution, as it happens for\nthe GRFrough and the WhiteNoise classes, where the optimality gaps are on average 31.0% and\n39.2%, respectively.\nAs shown in Tables 1(a) and 1(b), the 3-partite approach is clearly faster than any of the alternatives\nconsidered here, despite being an exact method. In addition, we remark that, even on the bipartite\nformulation, the Network Simplex implementation of the Lemon Graph library is order of magnitude\nfaster than EMD, and hence it should be the best choice in this particular type of instances. We\nremark that it might be unfair to compare an algorithm implemented in C++ with an algorithm\nimplemented in Matlab, but still, the true comparison is on the solution quality more than on the\nruntime. Moreover, when implemented on modern GPU that can fully exploit parallel matrix-vector\noperations, the Sinkhorn\u2019s algorithm can run much faster, but they cannot improve the optimality gap.\nIn order to evaluate how our approach scale with the size of the images, we run additional tests using\nimages of size 64 \u00d7 64 and 128 \u00d7 128. Table 2 reports the results for the bipartite and 3-partite\napproaches for increasing size of the 2-dimensional histograms. The table report for each of the\ntwo approaches, the number of vertices |V | and of arcs |A|, and the means and standard deviations\nof the runtime. As before, each row gives the averages over 45 instances. Table 2 shows that the\n3-partite approach is clearly better (i) in terms of memory, since the 3-partite graph has a fraction of\nthe number of arcs, and (ii) of runtime, since it is at least an order of magnitude faster in computation\n\n7\n\n\fEMD [24]\n\nSinkhorn [11]\n\nBipartite\n\n3-partite\n\n\u03bb = 1\n\n\u03bb = 1.5\n\nImage Class\nClassic\nMicroscopy\nShapes\n\nRuntime\n24.0 (3.3)\n35.0 (3.3)\n25.2 (5.3)\n\nRuntime\n6.0 (0.5)\n3.5 (1.0)\n1.6 (1.1)\n\nGap Runtime\n17.3% 8.9 (0.7)\n2.4% 5.3 (1.4)\n5.6% 2.5 (1.6)\n\nGap\nRuntime\n9.1% 0.54 (0.05)\n1.2% 0.55 (0.03)\n3.0% 0.50 (0.07)\n\nRuntime\n0.07 (0.01)\n0.08 (0.01)\n0.05 (0.01)\n\n(a)\n\nImproved Sinkhorn [28, 27]\n\n3-partite\n\n\u03bb = 1\n\n\u03bb = 1.25\n\n\u03bb = 1.5\n\nImage Class\nCauchyDensity\nClassic\nGRFmoderate\nGRFrough\nGRFsmooth\nLogGRF\nLogitGRF\nMicroscopy\nShapes\nWhiteNoise\n\nRuntime\n0.22 (0.15)\n0.20 (0.01)\n0.19 (0.01)\n0.19 (0.01)\n0.20 (0.02)\n0.22 (0.05)\n0.22 (0.02)\n0.18 (0.03)\n0.11 (0.04)\n0.18 (0.01)\n\nGap\nRuntime\n2.8% 0.33 (0.23)\n17.3% 0.31 (0.02)\n12.6% 0.29 (0.02)\n58.7% 0.29 (0.01)\n4.3% 0.30 (0.04)\n1.3% 0.32 (0.08)\n4.7% 0.33 (0.03)\n2.4% 0.27 (0.04)\n5.6% 0.16 (0.06)\n76.3% 0.28 (0.01)\n\n(b)\n\nGap\nRuntime\n2.0% 0.41 (0.28)\n12.4% 0.39 (0.03)\n9.0% 0.37 (0.03)\n42.1% 0.38 (0.02)\n3.1% 0.38 (0.04)\n0.9% 0.40 (0.13)\n3.3% 0.42 (0.04)\n1.7% 0.34 (0.05)\n4.0% 0.20 (0.07)\n53.8% 0.37 (0.02)\n\nGap\nRuntime\n1.5% 0.07 (0.01)\n9.1% 0.07 (0.01)\n6.6% 0.07 (0.01)\n31.0% 0.05 (0.01)\n2.2% 0.08 (0.01)\n0.7% 0.08 (0.01)\n2.5% 0.07 (0.02)\n1.2% 0.08 (0.02)\n3.0% 0.05 (0.01)\n39.2% 0.04 (0.00)\n\nTable 1: Comparison of different approaches on 32\u00d7 32 images. The runtime (in seconds) is given as\n\u00b7 100, where U B is the\n\u201cMean (StdDev)\u201d. The gap to the optimal value opt is computed as U B\u2212opt\nupper bound computed by Sinkhorn\u2019s algorithm. Each row reports the averages over 45 instances.\n\nopt\n\nSize\n64 \u00d7 64\n\nImage Class\nClassic\nMicroscopy\nShape\n128 \u00d7 128 Classic\n\nMicroscopy\nShape\n\n|V |\n8 193\n\nBipartite\n|A|\n16 777 216\n\n32 768 268 435 456\n\nRuntime\n16.3 (3.6)\n11.7 (1.4)\n13.0 (3.9)\n1 368 (545)\n959 (181)\n983 (230)\n\n|V |\n12 288\n\n3-partite\n|A|\n524 288\n\n49 152\n\n4 194 304\n\nRuntime\n2.2 (0.2)\n1.0 (0.2)\n1.1 (0.3)\n36.2 (5.4)\n23.0 (4.8)\n17.8 (5.2)\n\nTable 2: Comparison of the bipartite and the 3-partite approaches on 2-dimensional histograms.\n\ntime. Indeed, the 3-partite formulation is better essentially because it exploits the structure of the\nground distance c(x, y) used, that is, the squared (cid:96)2 norm.\n\nFlow Cytometry biomedical data. Flow cytometry is a laser-based biophysical technology used\nto study human health disorders. Flow cytometry experiments produce huge set of data, which\nare very hard to analyze with standard statistics methods and algorithms [9]. Currently, such data\nis used to study the correlations of only two factors (e.g., biomarkers) at the time, by visualizing\n2-dimensional histograms and by measuring the (dis-)similarity between pairs of histograms [21].\nHowever, during a \ufb02ow cytometry experiment up to hundreds of factors (biomarkers) are measured\nand stored in digital format. Hence, we can use such data to build d-dimensional histograms that\nconsider up to d biomarkers at the time, and then comparing the similarity among different individuals\nby measuring the distance between the corresponding histograms. In this work, we used the \ufb02ow\ncytometry data related to Acute Myeloid Leukemia (AML), available at http://flowrepository.\n\n8\n\n\fN\n16\n\n32\n\nd\n2\n3\n4\n2\n3\n\nn\n256\n4 096\n65 536\n1 024\n32 768\n\n|V |\n512\n8 192\n\n2 048\n\nBipartite Graph\n\n|A|\n65 536\n16 777 216\nout-of-memory\n1 048 756\nout-of-memory\n\nRuntime\n0.024 (0.01)\n38.2 (14.0)\n\n0.71 (0.14)\n\n|V |\n768\n16 384\n327 680\n3072\n131 072\n\n|A|\n8 192\n196 608\n4 194 304\n65 536\n3 145 728\n\n(d + 1)-partite Graph\n\nRuntime\n0.003 (0.00)\n0.12 (0.02)\n4.8 (0.84)\n0.04 (0.01)\n5.23 (0.69)\n\nTable 3: Comparison between the bipartite and the (d + 1)-partite approaches on Flow Cytometry\ndata.\n\norg/id/FR-FCM-ZZYA, which contains cytometry data for 359 patients, classi\ufb01ed as \u201cnormal\u201d or\naffected by AML. This dataset has been used by the bioinformatics community to run clustering\nalgorithms, which should predict whether a new patient is affected by AML [1].\nTable 3 reports the results of computing the distance between pairs of d-dimensional histograms, with\nd ranging in the set {2, 3, 4}, obtained using the AML biomedical data. Again, the \ufb01rst d-dimensional\nhistogram plays the role of the source distribution \u00b5, while the second histogram gives the target\ndistribution \u03bd. For simplicity, we considered regular histograms of size n = N d (i.e., n is the total\nnumber of bins), using N = 16 and N = 32. Table 3 compares the results obtained by the bipartite\nand (d + 1)-partite approach, in terms of graph size and runtime. Again, the (d + 1)-partite approach,\nby exploiting the structure of the ground distance, outperforms the standard formulation of the optimal\ntransport problem. We remark that for N = 32 and d = 3, we pass for going out-of-memory with the\nbipartite formulation, to compute the distance in around 5 seconds with the 4-partite formulation.\n\n5 Conclusions\n\nIn this paper, we have presented a new network \ufb02ow formulation on (d + 1)-partite graphs that can\nspeed up the optimal solution of transportation problems whenever the ground cost function c(x, y)\n(see objective function (3)) has a separable structure along the main d directions, such as, for instance,\nthe squared (cid:96)2 norm used in the computation of the Kantorovich-Wasserstein distance of order 2.\nOur computational results on two different datasets show how our approach scales with the size of the\nhistograms N and with the dimension of the histograms d. Indeed, by exploiting the cost structure,\nthe proposed approach is better in term of memory consumption, since it has only dn\nd arcs instead\nof n2. In addition, it is much faster since it has to solve an uncapacitated minimum cost \ufb02ow problem\non a much smaller \ufb02ow network.\n\nd+1\n\nAcknowledgments\n\nWe are deeply indebted to Giuseppe Savar\u00e9, for introducing us to optimal transport and for many\nstimulating discussions and suggestions. We thanks Mattia Tani for a useful discussion concerning\nthe Improved Sinkhorn\u2019s algorithm.\nThis research was partially supported by the Italian Ministry of Education, University and Research\n(MIUR): Dipartimenti di Eccellenza Program (2018\u20132022) - Dept. of Mathematics \u201cF. Casorati\u201d,\nUniversity of Pavia.\nThe last author\u2019s research is partially supported by \u201cPRIN 2015. 2015SNS29B-002. Modern Bayesian\nnonparametric methods\u201d.\n\nReferences\n[1] Nima Aghaeepour, Greg Finak, Holger Hoos, Tim R Mosmann, Ryan Brinkman, Raphael Gottardo,\nRichard H Scheuermann, FlowCAP Consortium, DREAM Consortium, et al. Critical assessment of\nautomated \ufb02ow cytometry data analysis techniques. Nature methods, 10(3):228, 2013.\n\n9\n\n\f[2] Ravindra K Ahuja, Thomas L Magnanti, and James B Orlin. Network \ufb02ows: Theory, Algorithms, and\nApplications. Cambridge, Mass.: Alfred P. Sloan School of Management, Massachusetts Institute of\nTechnology, 1988.\n\n[3] Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation algorithms for\noptimal transport via Sinkhoirn iteration. In Advances in Neural Information Processing Systems, pages\n1961\u20131971, 2017.\n\n[4] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar\u00e9. Gradient \ufb02ows: in metric spaces and in the space of\n\nprobability measures. Springer Science & Business Media, 2008.\n\n[5] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[6] Federico Bassetti, Antonella Bodini, and Eugenio Regazzini. On minimum Kantorovich distance estimators.\n\nStatistics & probability letters, 76(12):1298\u20131302, 2006.\n\n[7] Federico Bassetti, Stefano Gualandi, and Marco Veneroni. On the computation of Kantorovich-Wasserstein\ndistances between 2D-histograms by uncapacitated minimum cost \ufb02ows. arXiv preprint arXiv:1804.00445,\n2018.\n\n[8] Federico Bassetti and Eugenio Regazzini. Asymptotic properties and robustness of minimum dissimilarity\nestimators of location-scale parameters. Theory of Probability & Its Applications, 50(2):171\u2013186, 2006.\n\n[9] Tytus Bernas, Elikplimi K Asem, J Paul Robinson, and Bartek Rajwa. Quadratic form: a robust metric for\n\nquantitative comparison of \ufb02ow cytometric histograms. Cytometry Part A, 73(8):715\u2013726, 2008.\n\n[10] Lenaic Chizat, Gabriel Peyr\u00e9, Bernhard Schmitzer, and Fran\u00e7ois-Xavier Vialard. Scaling algorithms for\n\nunbalanced transport problems. arXiv preprint arXiv:1607.05816, 2016.\n\n[11] Marco Cuturi. Sinkhoirn distances: Lightspeed computation of optimal transport. In Advances in Neural\n\nInformation Processing Systems, pages 2292\u20132300, 2013.\n\n[12] Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters.\n\nConference on Machine Learning, pages 685\u2013693, 2014.\n\nIn International\n\n[13] Merrill M Flood. On the Hitchcock distribution problem. Paci\ufb01c Journal of Mathematics, 3(2):369\u2013386,\n\n1953.\n\n[14] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with\n\na Wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053\u20132061, 2015.\n\n[15] Andrew V Goldberg, \u00c9va Tardos, and Robert Tarjan. Network \ufb02ow algorithm. Technical report, Cornell\n\nUniversity Operations Research and Industrial Engineering, 1989.\n\n[16] P\u00e9ter Kov\u00e1cs. Minimum-cost \ufb02ow algorithms: an experimental evaluation. Optimization Methods and\n\nSoftware, 30(1):94\u2013127, 2015.\n\n[17] Elizaveta Levina and Peter Bickel. The Earth Mover \u2019s distance is the Mallows distance: Some insights\nfrom statistics. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference\non, volume 2, pages 251\u2013256. IEEE, 2001.\n\n[18] Matthias Liero, Alexander Mielke, and Giuseppe Savar\u00e9. Optimal entropy-transport problems and a new\nHellinger\u2013Kantorovich distance between positive measures. Inventiones mathematicae, 211(3):969\u20131117,\n2018.\n\n[19] Haibin Ling and Kazunori Okada. An ef\ufb01cient Earth Mover \u2019s distance algorithm for robust histogram\n\ncomparison. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5):840\u2013853, 2007.\n\n[20] James B Orlin. A faster strongly polynomial minimum cost \ufb02ow algorithm. Operations research,\n\n41(2):338\u2013350, 1993.\n\n[21] Darya Y Orlova, Noah Zimmerman, Stephen Meehan, Connor Meehan, Jeffrey Waters, Eliver EB Ghosn,\nAlexander Filatenkov, Gleb A Kolyagin, Yael Gernez, Shanel Tsuda, et al. Earth Mover \u2019s distance (EMD):\na true metric for comparing biomarker expression levels in cell populations. PloS one, 11(3):1\u201314, 2016.\n\n[22] O\ufb01r Pele and Michael Werman. Fast and robust Earth Mover \u2019s distances. In Computer vision, 2009 IEEE\n\n12th international conference on, pages 460\u2013467. IEEE, 2009.\n\n10\n\n\f[23] Gabriel Peyr\u00e9, Marco Cuturi, et al. Computational optimal transport. Technical Report 1803.00567, ArXiv,\n\n2018.\n\n[24] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for distributions with applications to image\n\ndatabases. In Computer Vision, 1998. Sixth International Conference on, pages 59\u201366. IEEE, 1998.\n\n[25] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The Earth Mover \u2019s distance as a metric for image\n\nretrieval. International Journal of Computer Vision, 40(2):99\u2013121, 2000.\n\n[26] J\u00f6rn Schrieber, Dominic Schuhmacher, and Carsten Gottschlich. Dotmark\u2013a benchmark for discrete\n\noptimal transport. IEEE Access, 5:271\u2013282, 2017.\n\n[27] Justin Solomon. Optimal transport on discrete domains. Technical Report 1801.07745, arXiv, 2018.\n\n[28] Justin Solomon, Fernando De Goes, Gabriel Peyr\u00e9, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du,\nand Leonidas Guibas. Convolutional Wasserstein distances: Ef\ufb01cient optimal transportation on geometric\ndomains. ACM Transactions on Graphics (TOG), 34(4):66, 2015.\n\n[29] Justin Solomon, Raif Rustamov, Leonidas Guibas, and Adrian Butscher. Wasserstein propagation for\n\nsemi-supervised learning. In International Conference on Machine Learning, pages 306\u2013314, 2014.\n\n[30] Anatoly Moiseevich Vershik. Long history of the Monge-Kantorovich transportation problem. The\n\nMathematical Intelligencer, 35(4):1\u20139, 2013.\n\n[31] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.\n\n11\n\n\f", "award": [], "sourceid": 2789, "authors": [{"given_name": "Gennaro", "family_name": "Auricchio", "institution": "University of Pavia"}, {"given_name": "Federico", "family_name": "Bassetti", "institution": "Politecnico di Milano"}, {"given_name": "Stefano", "family_name": "Gualandi", "institution": "University of Pavia"}, {"given_name": "Marco", "family_name": "Veneroni", "institution": "University of Pavia"}]}