{"title": "Interior-Point Methods Strike Back: Solving the Wasserstein Barycenter Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 6894, "page_last": 6905, "abstract": "Computing the Wasserstein barycenter of a set of probability measures under the optimal transport metric can quickly become prohibitive for traditional second-order algorithms, such as interior-point methods, as the support size of the measures increases. In this paper, we overcome the difficulty by developing a new adapted interior-point method that fully exploits the problem's special matrix structure to reduce the iteration complexity and speed up the Newton procedure. Different from regularization approaches, our method achieves a well-balanced tradeoff between accuracy and speed. A numerical comparison on various distributions with existing algorithms exhibits the computational advantages of our approach. Moreover, we demonstrate the practicality of our algorithm on image benchmark problems including MNIST and Fashion-MNIST.", "full_text": "Interior-point Methods Strike Back: Solving the\n\nWasserstein Barycenter Problem\n\nDongdong Ge\n\nResearch Institute for Interdisciplinary Sciences\nShanghai University of Finance and Economics\n\nge.dongdong@mail.shufe.edu.cn\n\nHaoyue Wang\u2217\n\nSchool of Mathematical Sciences\n\nFudan University\n\nhaoyuewang14@fudan.edu.cn\n\nZikai Xiong\u2217\n\nFudan University\n\nzkxiong16@fudan.edu.cn\n\nSchool of Mathematical Sciences\n\nDepartment of Management Science and Engineering\n\nYinyu Ye\n\nStanford University\nyyye@stanford.edu\n\nAbstract\n\nComputing the Wasserstein barycenter of a set of probability measures under the\noptimal transport metric can quickly become prohibitive for traditional second-\norder algorithms, such as interior-point methods, as the support size of the measures\nincreases. In this paper, we overcome the dif\ufb01culty by developing a new adapted\ninterior-point method that fully exploits the problem\u2019s special matrix structure to\nreduce the iteration complexity and speed up the Newton procedure. Different\nfrom regularization approaches, our method achieves a well-balanced tradeoff\nbetween accuracy and speed. A numerical comparison on various distributions\nwith existing algorithms exhibits the computational advantages of our approach.\nMoreover, we demonstrate the practicality of our algorithm on image benchmark\nproblems including MNIST and Fashion-MNIST.\n\n1\n\nIntroduction\n\nTo compare, summarize, and combine probability measures de\ufb01ned on a space is a fundamental task\nin statistics and machine learning. Given support points of probability measures in a metric space and\na transportation cost function (e.g. the Euclidean distance), Wasserstein distance de\ufb01nes a distance\nbetween two measures as the minimal transportation cost between them. This notion of distance\nleads to a host of important applications, including text classi\ufb01cation [30], clustering [25, 26, 15, 31],\nunsupervised learning [23, 13], semi-supervised learning [47], supervised-learning[27, 19], statistics\n[38, 39, 48, 21], and others [7, 41, 1, 44, 37]. Given a set of measures in the same space, the\n2-Wasserstein barycenter is de\ufb01ned as the measure minimizing the sum of squared 2-Wasserstein\ndistances to all measures in the set. For example, if a set of images (with common structure but\nvarying noise) are modeled as probability measures, then the Wasserstein barycenter is a mixture\nof the images that share this common structure. The Wasserstein barycenter better captures the\nunderlying geometric structure than the barycenter de\ufb01ned by the Euclidean or other distances. As a\nresult, the Wasserstein barycenter has applications in clustering [25, 26, 15], image retrieval [14] and\nothers [32, 43, 11, 34].\nFrom the computation point of view, \ufb01nding the barycenter of a set of discrete measures can be\nformulated by linear programming[6, 8]. Nonetheless, state-of-the-art linear programming solvers\ndo not scale with the immense amount of data involved in barycenter calculations. Current research\n\n\u2217Haoyue Wang and Zikai Xiong are corresponding authors.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fon computation mainly follows two types of methods. The \ufb01rst type attempts to solve the linear\nprogram (or some equivalent problem) with scalable \ufb01rst-order methods. J.Ye et al.\n[54] use\nmodi\ufb01ed Bregman ADMM(BADMM) \u2013 introduced by [51] \u2013 to compute Wasserestein barycenters\nfor clustering problems. L.Yang et al. [53] adopt symmetric Gauss-Seidel ADMM to solve the dual\nlinear program, which reduces the computational cost in each iteration. S.Claici et al. [12] introduce a\nstochastic alternating algorithm that can handle continuous input measures. However, these methods\nare still computationally inef\ufb01cient when the number of support points of the input measures and the\nnumber of input measures are large. Due to the nature of the \ufb01rst-order methods, these algorithms\noften converge too slowly to reach high-accuracy solutions.\nThe second, more mainstream, approach introduces an entropy regularization term to the linear\nprogramming formulation[14, 9]. This technique was \ufb01rst developed in solving optimal transportation\nproblem. See [14, 3, 18, 50, 33, 10, 22] for the related works. M. Staib et al. [49] discuss the parallel\ncomputation issue and introduce a sampling method. P.Dvurechenskii et al. [17] study decentralized\nand distributed computation for the regularized problem. These methods are indeed suitable for\nlarge-scale problems due to their low computational cost and parsimonious memory usage. However,\nthis advantage is obtained at the expense of the solution accuracy: especially when the regularization\nterm is weighted less in order to approximate the original problem more accurately, computational\nef\ufb01ciency degenerates and the outputs become unstable [9]. S. Amari et al. [5] propose a entropic\nregularization based sharpening technique but their result is not the accurate real barycenter. P.C.\nAlvarez-Esteban et al. [4] prove that the barycenter must be the \ufb01xed-point of a new operator. See\n[42] for a detailed survey of related algorithms.\nIn this paper, we develop a new interior-point method (IPM), namely Matrix-based Adaptive Al-\nternating interior-point Method (MAAIPM), to ef\ufb01ciently calculate the Wasserstein barycenters. If\nthe support is pre-speci\ufb01ed, we apply one step of the Mizuno-Todd-Ye predictor-corrector IPM[36].\nThe algorithm gains a quadratic convergence rate showed by Y. Ye et al. [56], which is a distinct\nadvantage of IPMs over \ufb01rst-order methods. In practice, we implement Mehrotra\u2019s predictor-corrector\nIPM [35], and add clever heuristics in choosing step lengths and centering parameters. If the support\nis also to be optimized, MAAIPM alternatively updates support and linear program variables in\nan adaptive strategy. At the beginning, MAAIPM updates support points X\u2217 by an unconstrained\nquadratic program after a few number of IPM iterations. At the end, MAAIPM updates X\u2217 after\nevery IPM iteration and applies the \"jump\" tricks to escape local minima. Under the framework of\nMAAIPM, we present two block matrix-based accelerated algorithms to quickly solve the Newton\nequations at each iteration. Despite a prevailing belief that IPMs are inef\ufb01cient for large-scale cases,\nwe show that such an inef\ufb01ciency can be overcome through careful manipulation of the block-data\nstructure of the normal equation. As a result, our stylized IPM has the following advantages.\nLow theoretical complexity. The linear programming formulation of\ni=1 mi + m variables andredN m +\ni=1 mi + 1 constraints, where the integers N, m and mi will be\nspeci\ufb01ed later. Although MAAIPM is still a second-order method,\nin our two block matrix-based accelerated algorithms, every iteration\nof solving the Newton direction has a time complexity of merely\ni ), where a stan-\n\nthe Wasserstein barycenter has m(cid:80)N\n(cid:80)N\nO(m2(cid:80)N\ni=1 mi + N m3) or O(m(cid:80)N\ndard IPM would need O(cid:0)(N m +(cid:80)N\n\ni=1 mi + m)(cid:1).\n\ni +(cid:80)N\n\ni=1 mi + 1)2(m(cid:80)N\n\ni=1 m3\n\ni=1 m2\n\nFor simplicity, let mi = m, i = 1, 2 . . . , N, then the time complexity\nof our algorithm in each iteration is O(N m3), instead of standard IPM\u2019s\ncomplexity O(N 3m4). Note that theoretically, when N m2 = (N m)k for\nsome 1 < k < 2, the complexity of the standard IPM can be reduced to\nO((N m)\u03c9(k)) + O((N m)3) via fast matrix computation methods, where\nthe speci\ufb01c value of \u03c9(k) can be found in table 3 of [20])\nPractical effectiveness in speed and accuracy. Compared to regular-\nized methods, IPMs gain high-accuracy solutions and high convergence\nrate by nature. Numerical experiments show that our algorithm converges\nto highly accurate solutions of the original linear program with the least\nnumber of iterations. Figure 1 shows the advantages of our methods in\naccuracy in comparison to the well-developed Sinkhorn-type algorithm\n[14, 9].\n\na\n\nFigure 1: A comparison\nof algorithms for com-\nputing the barycenters\nbetween\nSinkhorn\nbased approach[9](left)\nand MAAIPM(right).\nSamples of handbag(\ufb01rst\n4 rows) are from Fashion-\nMNIST dataset.\n\n2\n\n\fThere are more advantages of our approaches in real implementation. When the support points of\nmeasures are the same, there are several specially designed highly memory-ef\ufb01cient and thus very\nfast Sinkhorn based algorithms such as [46, 9]. However, when the support points of measures are\ndifferent, the convolutional method in [46] is no longer applicable and the memory usage of our\nmethod is within a constant multiple of the popular memory-ef\ufb01cient \ufb01rst-order Sinkhorn method,\nIBP[9], much less than the memory used by a commercial solver. In this case, experiments also show\nthat our algorithm can perform the best in both accuracy and overall runtime. Our algorithms also\ninherits a natural structure potentially \ufb01tting parallel computing scheme well. Those merits ensure\nthat our algorithm is highly suitable for large-scale computation of Wasserstein barycenters.\nThe rest of the paper is organized as follows. In section 2, we brie\ufb02y de\ufb01ne the Wasserstein barycenter.\nIn section 3, we present its linear programming formulation and introduce the IPM framework. In\nsection 4, we present an IPM implementation that greatly reduces the computational cost of classical\nIPMs. In section 5, we present our numerical results.\n\n2 Background and Preliminaries\n\ndiscrete probability measures [2, 16]. Let \u03a3n = {a \u2208 Rn|(cid:80)n\n\nIn this section, we brie\ufb02y recall the Wasserstein distance and the Wasserstein barycenter for a set of\ni=1 ai = 1, ai \u2265 0 for i = 1, 2, . . . n}\nbe the probability simplex in Rn. For two vectors s(1) \u2208 \u03a3n1, s(2) \u2208 \u03a3n2, de\ufb01ne the set of matrices\n: \u03a01n2 = s(1), \u03a0(cid:62)1n1 = s(2)}. Let P = {(ai, qi) : i = 1, . . . , m}\nM(s(1), s(2)) = {\u03a0 \u2208 Rn1\u00d7n2\ndenote the discrete probability measure supported on m points q1, . . . , qm in Rd with weights\na1, . . . , am respectively. The Wasserstein barycenter of the two measures U = {(ai, qi) : i =\n1, . . . , m1} and V = {(bj, pj) : j = 1, . . . , m2} is\n\n+\n\nW2(U,V) := min\n\n\u03c0ij(cid:107)qi \u2212 pj(cid:107)2 : \u03a0 = [\u03c0ij] \u2208 M(a, b)\n\n(1)\n\nwhere a = (a1, . . . , am1)(cid:62) and b = (b1, . . . , bm2)(cid:62). Consider a set of probability measures\n{P (t), t = 1,\u00b7\u00b7\u00b7 , N} where P (t) = {(a(t)\nmt)(cid:62).\nThe Wasserstein barycenter (with m support points) P = {(wi, xi) : i = 1,\u00b7\u00b7\u00b7 , m} is another\nprobability measure which is de\ufb01ned as a solution of the problem\n\n1 , . . . , a(t)\n\n, q(t)\n\ni\n\ni ) : i = 1, . . . , mt}, and let a(t) = (a(t)\nN(cid:88)\n1m = a(t), \u03a0(t) \u2265 0,\u2200t = 1,\u00b7\u00b7\u00b7 , N(cid:9). For a given set of\n\n(W2(P,P (t)))2.\n\n\u00d7 \u00b7\u00b7\u00b7 \u00d7 Rm\u00d7mN\n\n(2)\n\nt=1\n\n+\n\n:\n\nminP\n\nFurthermore, de\ufb01ne the simplex S = (cid:8)(w, \u03a0(1), . . . , \u03a0(N )) \u2208 Rm\nmw = 1, w \u2265 0; \u03a0(t)1mt = w,(cid:0)\u03a0(t)(cid:1)(cid:62)\n(cid:68)\nD(t)(X), \u03a0(t)(cid:69)\n\n1(cid:62)\nsupport points X = {x1, . . . , xm}, de\ufb01ne the distance matrices D(t)(X) = [(cid:107)xi\u2212q(t)\nj (cid:107)2\nfor t = 1, . . . , N. Then problem (2) is equivalent to\n\ns.t. (w, \u03a0(1), . . . , \u03a0(N )) \u2208 S, x1, . . . , xm \u2208 Rn.\n\n+ \u00d7 Rm\u00d7m1\n\nN(cid:88)\n\nmin\n\n+\n\n2] \u2208 Rm\u00d7mt\n\n1\nN\n\n(3)\n\nw,X,\u03a0(t)\n\nt=1\n\nProblem (3) is a nonconvex problem, where one needs to \ufb01nd the optimal support points X and the\noptimal weight vector w of a barycenter simultaneously. However, in many real applications, the\nsupport X of a barycenter can be speci\ufb01ed empirically from the support points of {P (t)}N\nt=1. Indeed,\nin some cases, all measures in {P (t)}N\nt=1 have the same set of support points and hence the barycenter\nshould also take the same set of support points. In view of this, we will also focus on the case when\nthe support X is given. Consequently, problem (3) reduces to the following problem:\n\n\uf8f1\uf8f2\uf8f3\n(cid:118)(cid:117)(cid:117)(cid:116) m1(cid:88)\n\ni=1\n\nm2(cid:88)\n\nj=1\n\n\uf8fc\uf8fd\uf8fe\n\n(cid:68)\n\nD(t), \u03a0(t)(cid:69)\n\nN(cid:88)\n\nt=1\n\nmin\nw,\u03a0(t)\n\ns.t. (w, \u03a0(1), . . . , \u03a0(N )) \u2208 S\n\n(4)\n\nwhere D(t) denotes D(X, Q(t)) for simplicity. In the following sections, we refer to problem (4) as\nthe Pre-speci\ufb01ed Support Problem, and call problem (3) the Free Support Problem.\n\n3\n\n\f3 General Framework for MAAIPM\n\nLinear programming formulation and preconditioning. Note that the Pre-speci\ufb01ed Support\nProblem is a linear program. In this subsection, we focus on removing redundant constraints. First,\n\nwe vectorize the constraints \u03a0(t)1mt = w and(cid:0)\u03a0(t)(cid:1)(cid:62)\n\n1m = a(t) captured in S to become\n\n(1(cid:62)\n\nmt\n\n\u2297 Im)vec(\u03a0(t)) = w, (Imt \u2297 1(cid:62)\n\nm)vec(\u03a0(t)) = a(t), t = 1,\u00b7\u00b7\u00b7 , N.\n\nThus, problem (4) can be formulated into the standard-form linear program:\n\nmin c(cid:62)x s.t. Ax = b, x \u2265 0\n\n2\n\n3\n\nmN\n\n0\n1m\n\n(cid:20)E(cid:62)\n\n(cid:21)(cid:62)\ni=1 mi, nrow := N m +(cid:80)N\n\n(5)\nwith x = (vec(\u03a0(1)); ...; vec(\u03a0(N )); w) , b = (a(1); a(2); ...; a(N ); 0m; ...; 0m; 1), c =\n\nm, ..., ImN \u2297 1(cid:62)\n\nncol := m(cid:80)N\n\n1 E(cid:62)\nE(cid:62)\n0\nm); E2 is a block diagonal matrix: E2 = diag(1(cid:62)\n\n(vec(D(1)); ...; vec(D(N )); 0) and A =\ntrix: E1 = diag(Im1 \u2297 1(cid:62)\nIm, ..., 1(cid:62)\n\n\u2297 Im); and E3 = \u22121N \u2297 Im. Let M :=(cid:80)N\n\n, where E1 is a block diagonal ma-\n\u2297\ni=1 mi + 1 and\ni=1 mi + m. Then A \u2208 Rnrow\u00d7ncol , b \u2208 Rnrow and c \u2208 Rncol. We are faced with a\nstandard form linear program with ncol variables and nrow constraints. In the spacial case where all\nmi = m, the number of variables is O(N m), and the number of constraints is O(N m2).\nFor ef\ufb01cient implementations of IPMs for this linear program, we need to remove redundant con-\nstraints.\nLemma 3.1 Let \u00afA \u2208 R(nrow\u2212N )\u00d7ncol be obtained from A by removing the (M +1)-th, (M +m+1)-\nth, \u00b7\u00b7\u00b7 , (M + (N \u2212 1)m + 1)-th rows of A, and \u00afb \u2208 Rnrow\u2212N be obtained from b by removing the\n(M + 1)-th, (M + m + 1)-th, \u00b7\u00b7\u00b7 , (M + (N \u2212 1)m + 1)-th entries of b. Then 1) \u00afA has full row\nrank;\n\n2) x satis\ufb01es Ax = b if and only if x satis\ufb01es \u00afAx = \u00afb.\n\nm1\n\nThe proof of this lemma is available in the supplement. With this lemma, the primal problem and\ndual problem of problem 5 can be written as\n\u00afAx = \u00afb, x \u2265 0.\n\n\u00afA(cid:62)\u03bb + s = c, s \u2265 0.\n\n(Primal) min c(cid:62)x s.t.\n\n(cid:62)\n(Dual) max \u00afb\n\np s.t.\n\n(6)\n\nFramework of Matrix-based Adaptive Alternating Interior-point Method (MAAIPM). When\nthe support points are not pre-speci\ufb01ed, we need to solve problem (3). As we just saw, When X is\n\ufb01xed, the problem becomes a linear program. When (w,{\u03a0(t)}) are \ufb01xed, the problem is a quadratic\noptimization problem with respect to X, and the optimal X\u2217 can be written in closed form as\n\ni =(cid:0)(cid:80)N\n\nx\u2217\n\nt=1\n\n(cid:80)mt\n\n(cid:1)\u22121(cid:80)N\n\n(cid:80)mt\n\nj=1 \u03c0(t)\n\nij\n\nt=1\n\nj=1 \u03c0(t)\n\nij q(t)\nj ,\n\ni = 1, 2 . . . , m.\n\n(7)\n\nIn anther word, (3) can be reformulated as\n\nmin c(x)(cid:62)x s.t. \u00afAx = \u00afb, x \u2265 0.\n\n(8)\nSince, as stated above, (3) is a non-convex problem and so it contains saddle points and local minima.\nThis makes \ufb01nding a global optimizer dif\ufb01cult. Examples of local minima and saddle points are\navailable in the supplement. The alternating minimization strategy used in [16, 53, 54] alternates\nbetween optimizing X by solving (7) and optimizing (w,{\u03a0(t)}) by solving (4). However, this\nalternating approach cannot avoid local minima or saddle points. Every iteration may require solving\na linear program (4), which is expensive when the problem size is large.\nTo overcome the drawbacks, we propose Matrix-based Adaptive Alternating IPM (MAAIPM). If the\nsupport is pre-speci\ufb01ed, we solve a single linear program by predictor-corrector IPM[35, 40, 52]. If\nthe support should be optimized, MAAIPM uses an adaptive strategy. At the beginning, because the\nprimal variables are far from the optimal solution, MAAIPM updates X\u2217 of (7) after a few number of\nIPM iterations for (w,{\u03a0(t)}). Then, MAAIPM updates X\u2217 after every IPM iteration and applies the\n\"jump\" tricks to escape local minima. Although MAAIPM cannot ensure \ufb01nding a globally optimal\nsolution, it can frequently get a better solution in shorter time. Since at the beginning MAAIPM\nupdates X\u2217 after many IPM iterations, primal dual predictor-corrector IPM is more ef\ufb01cient. At the\nend, X\u2217 is updated more often and each update of X\u2217 changes the linear programming objective\n\n4\n\n\ffunction so that dual variables may be infeasible. However, the primal variables always remain\nfeasible so that the primal IPM is more suitable at the end. Moreover, primal IPM is better for\napplying \"jump\" tricks or other local-minima-escaping techniques, which has been shown in [55].\nDetails and illustration are available in the supplement.\nIn predictor-corrector IPM, the main computational cost lies in solving the Newton equations, which\ncan be reformulated as the normal equations\n\n\u00afA(Dk)2 \u00afA(cid:62)\u2206\u03bbk = f k,\n\n(9)\n) and f k is in Rnrow\u2212N . This linear system of matrix \u00afA(Dk)2 \u00afA(cid:62)\nwhere Dk denotes diag(x(k)\ncan be ef\ufb01ciently solved by the two methods proposed in the next section. In the primal IPM,\nMAAIPM combines following the central path with optimizing the support points, i.e., it contains\nthree parts in one iteration, taking an Newton step in the logarithmic barrier function\n\ni /s(k)\n\ni\n\n\u00afAx = b,\n\ni=1 ln xi,\n\nsubject to\n\n(10)\nreducing the penalty \u00b5, and updating the support (7). The Newton direction pk at the kth iteration is\ncalculated by\n\npk = xk + (X k)2(cid:16) \u00afA(cid:62)(cid:0) \u00afA(X k)2 \u00afA(cid:62)(cid:1)\u22121(cid:0) \u00afA(X k)2c \u2212 \u00b5 \u00afAX k1(cid:1) \u2212 c\n\n(11)\n). The main cost of primal IPM lies in solving a linear system of \u00afA(X k)2 \u00afA(cid:62),\nwhere X k = diag(x(k)\nwhich again can be ef\ufb01ciently solved by the two methods described in the following section. Further\nmore, we also apply the warm-start technique to smartly choose the starting point of the next IPM after\n\"jump\" [45]. Compared with primal-dual IPMs\u2019 warm-start strategies [29, 28], our technique saves\nthe searching time, and only requires slightly more memory. When we suitably set the termination\ncriterion, numerical studies show that MAAIPM outperforms previous algorithms in both speed and\naccuracy, no matter whether the support is pre-speci\ufb01ed or not.\n\n(cid:17)\n\n/\u00b5k,\n\ni\n\nminimize c(cid:62)x \u2212 \u00b5(cid:80)n\n\n4 Ef\ufb01cient Methods for Solving the Normal Equations\nIn this section, we discuss ef\ufb01cient methods for solving normal equations in the format ( \u00afAD \u00afA(cid:62))z =\nf, where D is a diagonal matrix with all diagonal entries being positive. Let d = diag(D), and\nM2 = N (m \u2212 1). First, through simple calculation, we have the following lemma on the structure of\nmatrix \u00afAD \u00afA(cid:62), whose proof is available in the supplement.\nLemma 4.1 \u00afAD \u00afAT can be written in the following format:\n\n\uf8ee\uf8f0 B1\n\n\u00afAD \u00afAT =\n\n0\nB(cid:62)\n2 B3 + B4 \u03b1\nc\n0\n\nB2\n\u03b1(cid:62)\n\n\uf8f9\uf8fb\n\nwhere B1 \u2208 RM\u00d7M is a diagonal matrix with positive diagonal entries; B2 \u2208 RM\u00d7M2 is a block-\ndiagonal matrix with N blocks (the size of the i-th block is (m\u22121)\u00d7mi); B3 \u2208 RM2\u00d7M2 is a diagonal\nmatrix with positive diagonal entries; Let y = d(ncol\u2212m+2 : ncol), then B4 = (1N 1(cid:62)\nN )\u2297diag(y),\nand \u03b1 = \u22121N \u2297 y; c = 1(cid:62)\nSingle low-rank regularization method (SLRM). Brie\ufb02y speaking, we will perform several basic\ntransformations on the matrix \u00afAD \u00afAT to transform it into an easy-to-solve format. Then we solve the\nsystem with the transformed coef\ufb01cient matrix and \ufb01nally transform the obtained solution back to get\nan solution of ( \u00afAD \u00afA(cid:62))z = f.\n\nmd(ncol \u2212 m + 1 : ncol).\n\n\uf8ee\uf8f0\n\n\uf8f9\uf8fb , V2 :=\n\n(cid:34)IM\n\n(cid:35)\nIM2 \u2212\u03b1/c\n\n1\n\nDe\ufb01ne V1 :=\nA2 := B4 \u2212 1\n\nIM\u2212B(cid:62)\n2 B\u22121\nc \u03b1\u03b1(cid:62). Then,\n\n1\n\nIM2\n\n1\n\n\uf8ee\uf8f0B1\n\nV2V1 \u00afAD \u00afAT V (cid:62)\n\n1 V (cid:62)\n\n2 =\n\nB3 \u2212 B(cid:62)\n\n2 B\u22121\n\n1 B2 + B4 \u2212 1\n\nc \u03b1\u03b1(cid:62)\n\nDe\ufb01ne Y = diag(y) \u2212 1\n\nc yy(cid:62), we have the following lemma.\n\n5\n\n, A1 := B3 \u2212 B(cid:62)\n\n2 B\u22121\n\n1 B2 and\n\n\uf8f9\uf8fb =\n\nc\n\n(cid:34)B1\n\nA1 + A2\n\n(cid:35)\n\n.\n\nc\n\n\fLemma 4.2\na) A1 is a block-diagonal matrix with N blocks. The size of each block is (m\u2212 1)\u00d7 (m\u2212 1). Further\nN )\u2297 Y , and Y is positive\nmore, A1 is positive de\ufb01nite and strictly diagonal dominant. b) A2 = (1N 1(cid:62)\nde\ufb01nite and strictly diagonal dominant.\n\n2 B\u22121\n\n1\n\n1 z(4);\n\nOutput: z\n\nc yy(cid:62));\n\n2 z(3) , z = V (cid:62)\n\nc z(2)(M + M2 + 1);\n\n8 compute z(4) = V (cid:62)\n\nz(3)(M + 1 : M + M2) = (A1 + A2)\u22121z(2)(M + 1 : M + M2);\n\nAlgorithm 1: Solver for the normal equation ( \u00afAD \u00afAT )z = f\nInput: d = diag(D) \u2208 Rncol; f \u2208 RM +N (m\u22121)+1\n1 compute B1, B2, B3, vector y = d(ncol \u2212 m + 2 : ncol) and c;\n2 compute T = B(cid:62)\nand matrices V1, V2;\n3 compute A1 = B3 \u2212 T B2 and A2 = (1N 1(cid:62)\nN ) \u2297 (diag(y) \u2212 1\n4 compute z(1) = V1f and z(2) = V2z(1);\n5 compute z(3)(1 : M ) = B\u22121\n1 z(2)(1 : M );\n6 compute z(3)(M + M2 + 1) = 1\n7 solve the linear system with coef\ufb01cient matrix A1 + A2 to get\n\nSince the positive def-\niniteness and diagonal\ndominance claimed in\nthis lemma, the com-\nputation of\nthe in-\nverse matrices of each\nblock of A1 and A2\nis numerically stable.\nNow we introduce the\nprocedure for solving\n( \u00afAD \u00afAT )z = f, as de-\nscried in Algorithm 1\n(z(1) - z(4) in the algo-\nrithm are intermediate\nvariables).\nIn step 7,\nwe need to solve a lin-\near system with coef\ufb01-\ncient matrix of dimension N (m\u22121)\u00d7N (m\u22121), which is hard to compute with common methods for\ndense symmetric matrices. In view of the low-rank structure of the matrix A2, we introduce a method,\nnamely Single Low-rank Regularization Method (SLRM), which requires only O(N m3) \ufb02ops in\n\u2297 Im\u22121.\ncomputation. Assume A1 = diag(A11, A22, ..., AN N ) and de\ufb01ne U =\nWe can solve the linear system (A1 + A2)x = g by Algorithm 2.\nThe proof of correctness of Algorithm 2\nand other analysis is available in the sup-\nplement.\n\nAlgorithm 2: SLRM for the system (A1 +\nA2)x = g\nInput: A1, A2, g\n1 compute A\u22121\n11 , ..., A\u22121\n2 set A\u22121\n1 = diag(A\u22121\n3 compute x(1) = A\u22121\n1 g;\n4 compute x(2) = U T x(1);\n5 compute x(3)(end \u2212 m + 2 : end) =\n\nDouble low-rank regularization method\n(DLRM) when m is large.\nIn many ap-\nplications, m is relatively large compared\nto mt. For instance, in the area of im-\nage identi\ufb01cation, the pixel support points\nof the images at hand are sparse (small\nmt) but different. To \ufb01nd the \"barycenter\"\nof these images, we need to assume the\n\"barycenter\" image has much more pixel\nsupport points (large m) than all the sam-\nple images. Sometimes, m might be about\n5 to 20 times of each mt. In this case, the\ncomputational cost of step 1 in SLRM is\nheavy, since we need to solve N linear systems with dimension m \u00d7 m. In this subsection, we use\nthe low rank regularization formula to further reduce the computational cost.\nIn view of lemma 4.1, assume\n\n6 set x(3)(1 : end \u2212 m + 1) = 0 ;\n7 compute x(4) = U x(3) and x(5) = A\u22121\n8 compute x = x(1) \u2212 x(5);\nOutput: x\n\n(cid:20)IN\u22121 1N\u22121\n\nii )\\x(2)(end \u2212 m + 2 : end);\n\n(Y \u22121 +(cid:80)N\n\nii , i = 1, .., N;\n\ni=1 A\u22121\n\n(cid:21)\n\n0\n\n1\n\nN N );\n\n1 x(4);\n\nB1 = diag(B11, ..., B1N ), B2 = diag(B21, ..., B2N ), B3 = diag(B31, ..., B3N ).\n\nwhere B1i \u2208 Rmi\u00d7mi, B2i \u2208 Rmi\u00d7(m\u22121) and B3i \u2208 R(m\u22121)\u00d7(m\u22121). Recall that A1 = B3 \u2212\n2 B\u22121\nB(cid:62)\n1i B2i. Since m >> mi,\nwe can use the following formula:\n\n1 B2 and A1 = diag(A11, ..., AN N ), we have Aii = B3i \u2212 B(cid:62)\n2i(B1i \u2212 B2iB\u22121\nii = (B3i \u2212 B(cid:62)\nA\u22121\n\n(12)\nInstead of calculating and storing each Aii explicitly, we can just calculate and store each (B1i \u2212\nB2iB\u22121\n2i)\u22121. When we need to calculate Aiiy for some vector y, we can use (12) and sequentially\n\n1i B2i)\u22121 = B\u22121\n\n2i)\u22121B2iB\u22121\n3i .\n\n3i + B\u22121\n\n2iB\u22121\n\n3i B(cid:62)\n\n3i B(cid:62)\n\n2iB\u22121\n\n3i B(cid:62)\n\n6\n\n\fmultiply each matrix with vectors. As a result, the \ufb02ops required in step 1 of SLRM reduce to\ni=1mi), which\nO(m\u03a3N\nis at the same level (except for a constant) of a primal variable.\n\ni ), and the total memory usage of whole MAAIPM is O(m\u03a3N\n\ni + \u03a3N\n\ni=1m2\n\ni=1m3\n\nComplexity analysis. The following theorem summarizes the time and space complexity of the\naforementioned two methods.\n\nTheorem 4.3 a) For SLRM, the time complexity in terms of \ufb02ops is O(m2(cid:80)N\nmemory usage in terms of doubles is O(m(cid:80)N\ni +(cid:80)N\nin terms of \ufb02ops is O(m(cid:80)N\nO(m(cid:80)N\n\ni=1 mi +N m3), and the\ni=1 mi + N m2); b) For the DLRM, the time complexity\ni ), and the memory usage in terms of doubles is\n\ni=1 mi +(cid:80)N\n\ni ).\ni=1 m2\n\ni=1 m2\n\ni=1 m3\n\nWe can choose between SLRM and DLRM for different cases to achieve lower time and space\ncomplexity. Note that as N, m, mi grows up, the memory usage here is within an constant time of\nthe representative Sinkhorn type algorithms like IBP[9].\n\n5 Experiments\n\nWe conduct three numerical experiments to investigate the real performance of our methods. The \ufb01rst\nexperiment shows the advantages of SLRM and DLRM over traditional approaches in solving Newton\nequations with a same structure as barycenter problems. The second experiment fully demonstrates\nthe merits of MAAIPM: high speed/accuracy and more ef\ufb01cient memory usage. In the last experiment\nwith real benchmark data, MAAIPM recovers the images better than any other approach implemented.\nIn different experiments, we compare our methods with state-of-art commercial solvers(MATLAB,\nGurobi, MOSEK), the iterative Bregman projection (IBP) by [9], Bregman ADMM (BADMM)\n[51, 54]. The result also illustrates MAAIPM\u2019s superiority over symmetric Gauss-Seidel ADMM\n(sGS-ADMM) [53].\nAll experiments are run in Matlab R2018b on a workstation with two processors, Intel(R) Xeon(R)\nProcessor E5-2630@2.40Ghz (8 cores and 16 threads per processor) and 64GB of RAM, equipped\nwith 64-bit Windows 10 OS. Full experiment details are available in the supplement.\nExperiments on solving the normal equations:\nFor \ufb01gure 2, one can see that both SLRM and DLRM\nclealy outperform the Matlab solver in in all cases.\nFor computation time, SLRM increases linearly with\nrespect to N and m(cid:48), and DLRM increases linearly\nwith respect to N and m, which matches the con-\nIn practice, we select\nclusions in Theorem 4.3.\nt and DLRM when\n\nSLRM when m2 \u2264 4(cid:80)N\nm2 > 4(cid:80)N\n\nt=1 m2\n\nt .\nt=1 m2\n\n1 , . . . , q(t)\n1 , . . . , a(t)\n\nExperiments on barycenter problems:\nIn this ex-\nperiment, we set d = 3 for convenience. For P (t),\neach entry of (q(t)\nm(cid:48)) is generated with i.i.d.\nstandard Gaussian distribution. The entries of the\nweight vectors (a(t)\nm(cid:48)) are simulated by uni-\nform distribution on (0, 1) and then are normalized.\nNext we apply the k-means2 method to choose m\npoints to be the support points. Note that Gurobi and\nMOSEK use a crossover strategy when close to the\nexact solution to ensure obtaining a highly accurate\nsolution, we can regard Gurobi\u2019s objective value Fgu\nas the exact optimal value of the linear program (4).\nLet \"normalized obj\" denote the normalized objective value de\ufb01ned by |Fmethod \u2212 Fgu|/Fgu, where\nFmethod is the objective value respectively obtained by each method. Let \"feasibility error\" denote\n\nFigure 2: Average computation time of 200\nindependent trials in solving the linear system.\nEntries of diagonal D and f are generated by\nuniform distribution in (0, 1). In base situa-\ntion, N = 50, m = 50, m(cid:48) = 25. Sub-\ufb01gures\nshow the computation times when rescaling\nN, m and m1 = \u00b7\u00b7\u00b7 = mN = m(cid:48) by respec-\ntively \u03b1N , \u03b1m and \u03b1m(cid:48) times.\n\n2We call the Malab function \"kmeans\" in statistics and machine learning toolbox.\n\n7\n\n5101500.050.10.15runtime(seconds)51015010203040runtime(seconds)SLRMDLRMMatlab5101500.51runtime(seconds)510152002468runtime(seconds)\f(cid:110) (cid:107){\u03a0(t)1mt\u2212w}(cid:107)F\n\n1+(cid:107)w(cid:107)F +(cid:107){\u03a0(t)}(cid:107)F\n\n,|1(cid:62)w \u2212 1|(cid:111)\n\n(cid:107){(\u03a0(t))(cid:62)1m\u2212a(t)}(cid:107)F\n1+(cid:107){a(t)}(cid:107)F +(cid:107){\u03a0(t)}(cid:107)F\n\n,\n\n, as a measure of the distance to the\n\nmax\nfeasible set.\nFrom \ufb01gure 3, we see that MAAIPM displays a\nsuper-linear convergence rate for the objective,\nwhich is consistent with the result of [56]. Note\nthat the feasibility error of MAAIPM increases\na little bit near the end but is still much lower\nthan BADMM and IBP. Although other methods\nmay have lower objective values in early stages,\ntheir solutions are not acceptable due to high\nfeasibility errors.\nThen we run numerical experiments to test the\ncomputation time of methods in pre-speci\ufb01ed\nsupport points cases. For MAAIPM, we termi-\n\u03bbk| +\nnate it when (b\n|c(cid:62)xk|) is less than 5 \u00d7 10\u22125. For sGS-ADMM, we compare with it indirectly by the bench-\nmark claimed in their paper [53]: commercial solver Gurobi 8.1.0 [24] (academic license) with the\ndefault parameter settings. We also compare with another commercial solver MOSEK 9.1.0(aca-\ndemic license). In our observation, MAAIPM can frequently perform better than other popular\ncommercial solvers. We use the default parameter setting(optimal for most cases) for Gurobi\nand MOSEK so that they can exploit multiple processors (16 threads) while other methods are\n(cid:1) 1\nimplemented with only one thread3. For BADMM, we follow the algorithm 4 in [54] to im-\nplement and terminate when (cid:107)\u03a0(k,1) \u2212 \u03a0(k,2)(cid:107)F /(1 + (cid:107)\u03a0(k,1)(cid:107)F + (cid:107)\u03a0(k,2)(cid:107)F ) < 10\u22125. Set\n2 . For IBP, we follow the remark 3 in [9] to implement the\nk } \u2212 {u(n\u22121)\n}(cid:107)F ) < 10\u22128 and\nmethod, terminate it when (cid:107){u(n)\nk }(cid:107)F + (cid:107){v(n\u22121)\n}(cid:107)F ) < 10\u22128, and choose the regularization\n(cid:107){v(n)\nparameter \u0001 from {0.1, 0.01, 0.001} in our experiments. For BADMM and IBP, we implement the\nMatlab codes4 by J.Ye et al. [54] and set the maximum iterate number respectively 4000 and 105.\n\nFigure 3: Performance of methods in pre-speci\ufb01ed\nsupport cases. N = m = 50 and m1 = \u00b7\u00b7\u00b7 =\nmN = 50\n\n(cid:107){At}(cid:107)F = (cid:0)(cid:80)N\n\nt=1 (cid:107)At(cid:107)2\n}(cid:107)F /(1 + (cid:107){v(n)\n\n}(cid:107)F /(1 + (cid:107){u(n)\n\nk }(cid:107)F + (cid:107){u(n\u22121)\n\nk } \u2212 {v(n\u22121)\n\nk\n\n(cid:62)\n\n\u03bbk \u2212 c(cid:62)xk)/(1 + |b\n\n(cid:62)\n\nF\n\nk\n\nk\n\nk\n\nFigure 4: The left 8 \ufb01gures are the average computation time, normalized objective value and\nfeasibility error of Gurobi, MOSEK, MAAIPM, BADMM and IBP(\u0001 = 0.1, 0.01, 0.001) in pre-\nspeci\ufb01ed support cases from 30 independent trials. In the \ufb01rst row, m = 100, mt follows an uniform\ndistribution on (75, 125). In the second row, N = 50, m = 100 and m1 = \u00b7\u00b7\u00b7 = mN = m(cid:48). The\nright \ufb01gure is the average computation time of Gurobi and MAAIPM in pre-speci\ufb01ed support cases\nfrom 10 independent trials. mt follows a uniform distribution on (150, 250), and m = 200.\nFrom the left 8 sub-\ufb01gures in \ufb01gure 4 one can observe that MAAIPM returns a considerably accurate\nsolution in the second shortest computation time. For IBP, although it returns an objective value in\nthe shortest time when \u0001 = 0.1, the quality of the solution is almost the worst. Because IBP only\nsolves an approximate problem, if \u0001 is set smaller, the computation time sharply increases but the\n\n3We call the Matlab function \"maxNumCompThreads(1)\"\n4Available in https://github.com/bobye/WBC_Matlab\n\n8\n\n00.050.10.150.20.250.30.3510-410-210010200.050.10.150.20.250.30.3510-610-510-410-310-210-11002040608010-11001011021031042040608005101520252040608010-510-310-11011032040608010-510-410-310-210-110010020030040050060010-110010110210310410510020030040050060002040608010010020030040050060010-510-310-110110310020030040050060010-510-410-310-210-110050010001500200025003000020004000600080001000012000140001600018000\fFigure 5: computation time and normalized objective\nvalue of MAAIPM, BADMM and IBP in the free sup-\nport cases from 30 independent trials. \"Normalized\nobj\" denote Fmethod/FM AAIP M \u2212 1, where Fmethod\nis the objective value obtained by each method. N\ntakes different values and m = m(cid:48) = 50.\n\nquality of the solution is still not ensured. For BADMM, it gives a solution close to the exact one, but\nrequires much more computation time.\nFor Gurobi and MOSEK, although they can\nexploit 16 threads, the computation time\nis far more than that of MAAIPM That is\nto say, MAAIPM also largely outperforms\nsGS-ADMM in speed, according to table 1,\n2, 3 in [53]. Moreover, because the num-\nber of iterations remains almost indepen-\ndent of the problem size, the main compu-\ntational cost of MAAIPM is approximately\nlinear with respect to N and m(cid:48). In fact,\nwhen N = 5000, MAAIPM requires only\n3098.23 seconds, while MOSEK uses over\n20000 seconds. Although the memory usage\nof MAAIPM is within a constant multiple\nof that of IBP, the former one is ususaly\nlarger than the latter one. But the right sub-\n\ufb01gure in Figure 4 and the case of N = 5000\ndemonstrate that MAAIPM\u2019s memory usage\nis managed more ef\ufb01cient compared to Gurobi and MOSEK. These positive traits are consistent with\nthe time and memory complexity proved in Theorem 4.3.\nNext, we conduct numerical studies to test MAAIPM in free support\ncases, i.e., problem (3). Same as [54], we implement the version\nof BADMM and IBP that can automatically update support points\nand set the initial support points in multivariate normal distribution.\nWe set the maximum number of iterations in BADMM and IBP as\n104 and 106. The entries of (q(t)\nm(cid:48)) are generated with i.i.d.\nuniform distribution in (0, 100) and the initial support points fol-\nlows a Gaussian distribution. In \ufb01gure 6, \"Normalized obj\" denotes\nFmethod/FM AAIP M \u2212 1, where Fmethod is the objective value ob-\ntained by each iteration of methods. From \ufb01gure 5 and 6, one can\nsee that, in the free support cases, MAAIPM can still obtain the\nsmallest objective value in the second shortest time. That is because\nMAAIPM updates support more frequently and adopts \"jump\" tricks\nto avoid the local minima. Although IBP can obtain an approximate\nvalue in the shortest time when \u0001 = 0.1, the quality of the barycenter\nis too low to be useful.\n\nFigure 6: Performance of\nmethods in free support cases.\nN = 40, m = m1 = m2 =\n\u00b7\u00b7\u00b7 = mN = 50.\n\n1 , . . . , q(t)\n\nMNIST\n\n50\n\ntime(seconds)\n\n250\n\nTable 1: Experiments on datasets\n\nExperiments on real applica-\ntions: We conduct similar ex-\nperiments to [16, 53] on the\nMNIST4 and Fashion-MNIST4\ndatasets.\nIn MNIST, We ran-\ndomly select 200 images for digit\n8 and resize each image to 0.5,\n1, 2 times of its original size\n28 \u00d7 28.\nIn Fashion-MNIST,\nwe randomly select 20 images of\nhandbag, and resize each image\nto 0.5, 1 time of the original size.\nThe support points of images are\ndense and different. Next, for\neach case, we apply MAAIPM, BADMM and IBP(\u0001 = 0.01) to compute the Wasserstein barycenter\nin respectively free support cases and pre-speci\ufb01ed support cases. From table 1, one can see that,\nMAAIPM obtained the clearest and sharpest barycenters within the least computation time.\n\nFashion-MNIST\n25\n75\n\nIBP(\u0001 = 0.01)\n\n500\n\n1000\n\nMAAIPM\n\nBADMM\n\n4Available in http://yann.lecun.com/exdb/mnist/ and https://github.com/zalandoresearch/fashion-mnist\n\n9\n\n5010015020001002003004005006007005010015020010-410-21001020510152010-410-2100102\fAcknowledgments\n\nWe thank Tianyi Lin, Simai He, Bo Jiang, Qi Deng and Yibo Zeng for helpful discussions and fruitful\nsuggestions.\n\nReferences\n[1] S.S. Abadeh, V.A. Nguyen, D. Kuhn, and P.M.M Esfahani. Wasserstein distributionally robust kalman\n\n\ufb01ltering. In Advances in Neural Information Processing Systems 31, pages 8483\u20138492, 2018.\n\n[2] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis,\n\n43(2): 904\u2013924, 2011.\n\n[3] J. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via\n\nSinkhorn iteration. In Advances in Neural Information Processing Systems 30, pages 1964\u20131974, 2017.\n\n[4] P.C. Alvarez-Esteban, E. Barrio, J. Cuesta-Albertos, and C. Matran. A \ufb01xed-point approach to barycenters\n\nin Wasserstein space. Journal of Mathematical Analysis and Applications, 441(2): 744\u2013762, 2016.\n\n[5] S. Amari, R. Karakida, M.Oizumi and M. Cuturi Information geometry for regularized optimal transport\n\nand barycenters of patterns. Neural computation, 31(5): 827-848, 2019.\n\n[6] E. Anderes, S. Borgwardt, and J. Miller. Discrete Wasserstein barycenters: Optimal transport for discrete\ndata. Math Meth Oper Res,84(2):389\u2013409,October2016. ISSN 1432-2994, 1432-5217. doi: 10.1007/s00186-\n016-0549-x.\n\n[7] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In International\n\nConference on Machine Learning, pages 214\u2013223, 2017.\n\n[8] D. Bertsimas and J.N. Tsitsiklis. Introduction to Linear Optimization. Athena Scienti\ufb01c, 1997.\n\n[9] J.D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr\u00b4e. Iterative Bregman projections for regularized\n\ntransportation problems. SIAM Journal on Scienti\ufb01c Computing, 37(2):A1111\u2013A1138, 2015.\n\n[10] J. Blanchet, A. Jambulapati, C. Kent and A. Sidford. Towards optimal running times for optimal transport.\n\narXiv:1810.07717.\n\n[11] G. Carlier, A. Oberman, and E. Oudet. Numerical methods for matching for teams and Wasserstein\n\nbarycenters. ESAIM: Mathematical Modelling and Numerical Analysis, 49(6):1621\u20131642, 2015.\n\n[12] S. Claici, E. Chien, J. Solomon. Stochastic Wasserstein Barycenters. arXiv:1802.05757.\n\n[13] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy. Joint distribution optimal transportation for\n\ndomain adaptation. In Advances in Neural Information Processing Systems 30, pages 3730\u20133739. 2017.\n\n[14] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural\n\nInformation Processing Systems 26, pages 2292\u20132300, 2013.\n\n[15] A. Dessein, N. Papadakis, and C.A. Deledalle. Parameter estimation in \ufb01nite mixture models by regularized\n\noptimal transport: a uni\ufb01ed framework for hard and soft clustering. arXiv:1711.04366, 2017.\n\n[16] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In the 31st International Conference\n\non Machine Learning, pages 685\u2013693, 2014.\n\n[17] P. Dvurechenskii, D. Dvinskikh, A. Gasnikov, C. Uribe, and A. Nedich. Decentralize and randomize:\nFaster algorithm for Wasserstein barycenters. In Advances in Neural Information Processing Systems 30,\npages 10783\u201310793, 2018.\n\n[18] P. Dvurechensky, A. Gasnikov and A. Kroshnin. Computational Optimal Transport: Complexity by\nAccelerated Gradient Descent Is Better Than by Sinkhorn\u2019s Algorithm. Proceedings of the 35th International\nConference on Machine Learning, 80:1367-1376, 2018.\n\n[19] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T.A. Poggio. Learning with a Wasserstein loss. In\n\nAdvances in Neural Information Processing Systems 28, pages 2053\u20132061, 2015.\n\n[20] F. L. Gall and F. Urrutia. Improved Rectangular Matrix Multiplication using Powers of the Coppersmith-\n\nWinograd Tensor. arXiv:1708.05622. 2017\n\n10\n\n\f[21] R. Gao, L. Xie, Y. Xie, and H. Xu. Robust hypothesis testing using wasserstein uncertainty sets. In\n\nAdvances in Neural Information Processing Systems 31, pages 7913\u20137923. 2018.\n\n[22] A. Genevay, M. Cuturi, G. Peyr\u00b4e, and F. Bach. Stochastic optimization for large-scale optimal transport. In\n\nAdvances in Neural Information Processing Systems 29, pages 3440\u20133448, 2016.\n\n[23] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A.C. Courville. Improved training of Wasserstein\n\nGANs. In Advances in Neural Information Processing Systems 30, pages 5767\u20135777, 2017.\n\n[24] Inc. Gurobi Optimization. Gurobi Optimizer Reference Manual, 2018.\n\n[25] N. Ho, V. Huynh, D. Phung, and M.I. Jordan. Probabilistic multilevel clustering via composite transporta-\n\ntion distance. arXiv:1810.11911, 2018.\n\n[26] N. Ho, X. Nguyen, M. Yurochkin, H.H. Bui, V. Huynh, and D. Phung. Multilevel clustering via wasserstein\n\nmeans. In International Conference on Machine Learning, pages 1501\u20131509, 2017.\n\n[27] G. Huang, C. Guo, M.J. Kusner, Y. Sun, F. Sha, and K.Q. Weinberger. Supervised word mover\u2019s distance.\n\nIn Advances in Neural Information Processing Systems 29, pages 4862\u2013 4870, 2016.\n\n[28] E. John, E. A. Y\u0131ld\u0131r\u0131m. Implementation of warm-start strategies in interior-point methods for linear\n\nprogramming in \ufb01xed dimension. Computational Optimization and Applications, 41(2): 151-183, 2008.\n\n[29] E. A. Y\u0131ld\u0131r\u0131m, S. J. Wright Warm-start strategies in interior-point methods for linear programming. SIAM\n\nJournal on Optimization, 12(3): 782-810, 2002.\n\n[30] M.J. Kusner, Y. Sun, N.I. Kolkin, and K.Q. Weinberger. From word embeddings to document distances. In\n\nthe 32nd In International Conference on Machine Learning, pages 957\u2013966, 2015.\n\n[31] T. Lacombe, M. Cuturi, and S. Oudot. Large scale computation of means and clusters for persistence dia-\ngrams using optimal transport. In Advances in Neural Information Processing Systems 31, pages 9792\u20139802,\n2018.\n\n[32] J. Lee and M. Raginsky. Minimax statistical learning with wasserstein distances. In Advances in Neural\n\nInformation Processing Systems 31, pages 2692\u20132701. 2018.\n\n[33] T. Lin, N. Ho and M.I. Jordan. On Ef\ufb01cient Optimal Transport: An Analysis of Greedy and Accelerated\n\nMirror Descent Algorithms. arXiv:1901.06482.\n\n[34] A. Mallasto and A. Feragen. Learning from uncertain curves: The 2-wasserstein metric for gaussian\n\nprocesses. In Advances in Neural Information Processing Systems 30, pages 5660\u20135670. 2017.\n\n[35] S. Mehrotra. On the Implementation of a Primal-Dual interior-point Method. SIAM J. Optim., 2(4),\n\n575\u2013601. 1992.\n\n[36] S. Mizuno, M. J. Todd and Y. Ye. On adaptive-step primal-dual interior-point algorithms for linear\n\nprogramming. Mathematics of Operations research, 18(4): 964-981, 1993.\n\n[37] B. Muzellec and M. Cuturi. Generalizing point embeddings using the wasserstein space of elliptical\ndistributions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa- Bianchi, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 31, pages 10258\u201310269. 2018.\n\n[38] X. Nguyen. Convergence of latent mixing measures in \ufb01nite and in\ufb01nite mixture models. Annals of\n\nStatistics, 4(1):370\u2013400, 2013.\n\n[39] X. Nguyen. Borrowing strength in hierarchical Bayes: posterior concentration of the Dirichlet base\n\nmeasure. Bernoulli, 22(3):1535\u20131571, 2016.\n\n[40] J. Nocedal and S.J. Wright Numerical Optimization, 2006.\n\n[41] G. Peyr\u00b4e, M. Cuturi, and J. Solomon. Gromov-Wasserstein averaging of kernel and distance matrices. In\n\nInternational Conference on Machine Learning, pages 2664\u20132672, 2016.\n\n[42] G. Peyr\u00b4e and M. Cuturi. Computational Optimal Transport. arXiv:1803.00567.\n\n[43] J. Rabin, G. Peyr\u00b4e, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture mixing.\nIn Scale Space and Variational Methods in Computer Vision, volume 6667 of Lecture Notes in Computer\nScience, pages 435\u2013446. Springer, 2012.\n\n11\n\n\f[44] A. Rolet, M. Cuturi, and G. Peyr\u00b4e. Fast dictionary learning with a smoothed Wasserstein loss. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 630\u2013638, 2016.\n\n[45] A. Skajaa, E. D. Andersen, Y. Ye Warmstarting the homogeneous and self-dual interior-point method for\n\nlinear and conic quadratic problems. Mathematical Programming Computation, 5(1): 1-25, 2013.\n\n[46] J. Solomon, G. D. Goes, G. Peyr\u00e9, M. Cuturi, A. Butscher, A. Nguyen, D. Tao, L. Guibas, Convolu-\ntional wasserstein distances: Ef\ufb01cient optimal transportation on geometric domains. ACM Transactions on\nGraphics (TOG), 34(4), 66, 2015.\n\n[47] J. Solomon, R.M. Rustamov, L. Guibas, and A. Butscher. Wasserstein propagation for semi-supervised\n\nlearning. In the 31st International Conference on Machine Learning, pages 306\u2013314, 2014.\n\n[48] S. Srivastava, C. Li, and D. Dunson. Scalable Bayes via barycenter in Wasserstein space. Journal of\n\nMachine Learning Research, 19(8):1\u201335, 2018.\n\n[49] M. Staib, S. Claici, J.M. Solomon, and S. Jegelka. Parallel streaming Wasserstein barycenters. In Advances\n\nin Neural Information Processing Systems 30, pages 2647\u20132658, 2017a.\n\n[50] C. Villani. Optimal Transport: Old and New, volume 338. Springer Science & Business Media, 2008.\n\n[51] H. Wang and A. Banerjee. Bregman Alternating Direction Method of Multipliers. In Advances in Neural\n\nInformation Processing Systems 27, 2014, pp. 2816-2824.\n\n[52] S.J. Wright. Primal-dual interior-point methods, 1997.\n\n[53] L. Yang, J. Li, D. Sun and K.C. Toh. A Fast Globally Linearly Convergent Algorithm for the Computation\n\nof Wasserstein Barycenters, arXiv:1809.04249, 2018.\n\n[54] J. Ye, P. Wu, J.Z. Wang, and J. Li. Fast discrete distribution clustering using Wasserstein barycenter with\n\nsparse support. IEEE Transactions on Signal Processing, 65(9):2317\u20132332, 2017.\n\n[55] Y Ye. On af\ufb01ne scaling algorithms for nonconvex quadratic programming. Mathematical Programming,\n\nnL)-iteration algorithm for\n\n56(1-3): 285-300, 1992\n\n\u221a\n[56] Y. Ye, O. G\u00fcler, R. A. Tapia, and Y. Zhang. A quadratically convergent O (\n\nlinear programming. Mathematical Programming , 59(1):151-162 2014.\n\n12\n\n\f", "award": [], "sourceid": 3743, "authors": [{"given_name": "DongDong", "family_name": "Ge", "institution": "Shanghai University of Finance and Economics"}, {"given_name": "Haoyue", "family_name": "Wang", "institution": "Fudan University"}, {"given_name": "Zikai", "family_name": "Xiong", "institution": "Fudan University"}, {"given_name": "Yinyu", "family_name": "Ye", "institution": "Standord"}]}