{"title": "Sinkhorn Barycenters with Free Support via Frank-Wolfe Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 9322, "page_last": 9333, "abstract": "We present a novel algorithm to estimate the barycenter of arbitrary probability distributions with respect to the Sinkhorn divergence. Based on a Frank-Wolfe optimization strategy, our approach proceeds by populating the support of the barycenter incrementally, without requiring any pre-allocation. We consider discrete as well as continuous distributions, proving convergence rates of the proposed algorithm in both settings. Key elements of our analysis are a new result showing that the Sinkhorn divergence on compact domains has Lipschitz continuous gradient with respect to the Total Variation and a characterization of the sample complexity of Sinkhorn potentials. Experiments validate the effectiveness of our method in practice.", "full_text": "Sinkhorn Barycenters with Free Support via\n\nFrank-Wolfe Algorithm\n\nGiulia Luise1, Saverio Salzo2, Massimiliano Pontil1,2, Carlo Ciliberto3\n\ng.luise.16@ucl.ac.uk, saverio.salzo@iit.it, m.pontil@cs.ucl.ac.uk,c.ciliberto@ic.ac.uk\n\n1 Department of Computer Science, University College London, UK\n\n2 CSML, Istituto Italiano di Tecnologia, Genova, Italy\n\n3 Department of Electrical and Electronic Engineering, Imperial College London, UK\n\nAbstract\n\nWe present a novel algorithm to estimate the barycenter of arbitrary probability\ndistributions with respect to the Sinkhorn divergence. Based on a Frank-Wolfe\noptimization strategy, our approach proceeds by populating the support of the\nbarycenter incrementally, without requiring any pre-allocation. We consider dis-\ncrete as well as continuous distributions, proving convergence rates of the proposed\nalgorithm in both settings. Key elements of our analysis are a new result show-\ning that the Sinkhorn divergence on compact domains has Lipschitz continuous\ngradient with respect to the Total Variation and a characterization of the sample\ncomplexity of Sinkhorn potentials. Experiments validate the effectiveness of our\nmethod in practice.\n\n1\n\nIntroduction\n\nAggregating and summarizing collections of probability measures is a key task in several machine\nlearning scenarios. Depending on the metric adopted, the properties of the resulting average (or\nbarycenter) of a family of probability measures vary signi\ufb01cantly. By design, optimal transport\nmetrics are better suited at capturing the geometry of the distribution than Euclidean distance or\nf-divergences [14]. In particular, Wasserstein barycenters have been successfully used in settings\nsuch as texture mixing [40], Bayesian inference [49], imaging [26], or model ensemble [18].\nThe notion of barycenter in Wasserstein space was \ufb01rst introduced by [2] and then investigated\nfrom the computational perspective for the original Wasserstein distance [12, 50, 54] as well as its\nentropic regularizations (e.g. Sinkhorn) [6, 14, 20]. Two main challenges in this regard are: i) how to\nef\ufb01ciently identify the support of the candidate barycenter and ii) how to deal with continuous (or\nin\ufb01nitely supported) probability measures. The \ufb01rst problem is typically addressed by either \ufb01xing\nthe support of the barycenter a-priori [20, 50] or by adopting an alternating minimization procedure\nto iteratively optimize the support point locations and their weights [12, 14]. While \ufb01xed-support\nmethods enjoy better theoretical guarantees, free-support algorithms are more memory ef\ufb01cient and\npracticable in high dimensional settings. The problem of dealing with continuous distributions has\nbeen mainly approached by adopting stochastic optimization methods to minimize the barycenter\nfunctional [12, 20, 50].\nIn this work we propose a novel method to compute the barycenter of a set of probability distributions\nwith respect to the Sinkhorn divergence [25] that does not require to \ufb01x the support beforehand.\nWe address both the cases of discrete and continuous probability measures. In contrast to previous\nfree-support methods, our algorithm does not perform an alternate minimization between support and\nweights. Instead, we adopt a Frank-Wolfe (FW) procedure to populate the support by incrementally\nadding new points and updating their weights at each iteration, similarly to kernel herding strategies\n[5]. We prove the convergence of the proposed optimization scheme for both \ufb01nitely and in\ufb01nitely\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsupported distribution settings. A central result to our analysis is the characterization of regularity\nproperties of Sinkhorn potentials (i.e., the dual solutions of the Sinkhorn divergence problem), which\nextends recent work in [21, 23]. We empirically evaluate the performance of the proposed algorithm.\nContributions. The analysis of the proposed algorithm hinges on the following contributions: i) we\nshow that the gradient of the Sinkhorn divergence is Lipschitz continuous on the space of probability\nmeasures with respect to the Total Variation. This grants us convergence of the barycenter algorithm\nin \ufb01nite settings. ii) We characterize the sample complexity of Sinkhorn potentials of two empirical\ndistributions sampled from arbitrary probability measures. This latter result is interesting on its\nown but it also enables us to iii) design a concrete optimization scheme to approximately solve the\nbarycenter problem for arbitrary probability measures with convergence guarantees. iv) A byproduct\nof our analysis is the generalization of the FW algorithm to settings where the objective functional\nis de\ufb01ned only on a set with empty interior, which is the case for Sinkhorn divergence barycenter\nproblem.\nThe rest of the paper is organized as follows: Sec. 2 reviews standard notions of optimal transport\ntheory. Sec. 3 introduces the barycenter functional, and analyses the Lipschitz continuity of its\ngradient. Sec. 4 describes the implementation of our algorithm and Sec. 5 studies its convergence\nrates. Finally, Sec. 6 evaluates the proposed methods empirically and Sec. 7 provides concluding\nremarks.\n\n2 Background\n\nThe aim of this section is to recall de\ufb01nitions and properties of Optimal Transport theory with entropic\nregularization. Throughout the work, we consider a compact set X\u21e2 Rd and a symmetric cost\nfunction c : X\u21e5X! R. We set D := supx,y2X c(x, y) and denote by M+\n1 (X ) the space of\nprobability measures on X (positive Radon measures with mass 1). For any \u21b5,  2M +\n1 (X ), the\nOptimal Transport problem with entropic regularization is de\ufb01ned as follow [13, 24, 38]\n\nOT\"(\u21b5, ) = min\n\n\u21e12\u21e7(\u21b5,) ZX 2\n\nc(x, y) d\u21e1(x, y) + \"KL(\u21e1|\u21b5 \u2326 ),\"\n\n 0\n\n(1)\n\nwhere KL(\u21e1|\u21b5 \u2326 ) is the Kullback-Leibler divergence between the candidate transport plan \u21e1 and\nthe product distribution \u21b5 \u2326 , and \u21e7(\u21b5, ) = {\u21e1 2M 1\n+(X 2) : P1#\u21e1 = \u21b5, P2#\u21e1 = }, with\nthe projector onto the i-th component and # the push-forward operator. The case\nPi : X\u21e5X!X\n\" = 0 corresponds to the classic Optimal Transport problem introduced by Kantorovich [29]. In\nparticular, if c = k\u00b7  \u00b7kp for p 2 [1,1), then OT0 is the well-known p-Wasserstein distance [52].\nLet \"> 0. Then, the dual problem of (1), in the sense of Fenchel-Rockafellar, is (see [10, 21])\n\nOT\"(\u21b5, ) = max\n\nu,v2C(X )Z u(x) d\u21b5(x) +Z v(y) d(y)  \"Z e\n\nu(x)+v(y)c(x,y)\n\n\"\n\nd\u21b5(x)d(y),\n\n(2)\n\nwhere C(X ) denotes the space of real-valued continuous functions on X , endowed with k\u00b7k1. Let\n\u00b5 2M +\n\n1 (X ). We denote by T\u00b5 : C(X ) !C (X ) the map such that, for any w 2C (X ),\n\nT\u00b5(w) : x 7! \" logZ e\n\nw(y)c(x,y)\n\n\"\n\nd\u00b5(y).\n\n(3)\n\nThe \ufb01rst order optimality conditions for (2) are (see [21] or Appendix B.2)\n\nu = T(v) \u21b5- a.e.\n\nand\n\nv = T\u21b5(u) - a.e.\n\n(4)\nPairs (u, v) satisfying (4) exist [30] and are referred to as Sinkhorn potentials. They are unique (\u21b5, )\n- a.e. up to an additive constant, i.e., (u + t, v  t) is also a solution for any t 2 R. In line with\n[21, 23] it will be useful in the following to assume (u, v) to be the Sinkhorn potentials such that: i)\nu(xo) = 0 for an arbitrary anchor point xo 2X and ii) (4) is satis\ufb01ed pointwise on the entire domain\nX . Then, u is a \ufb01xed point of the map T\u21b5 = T  T\u21b5 (analogously for v). This suggests a \ufb01xed\npoint iteration approach to minimize (2), yielding the well-known Sinkhorn-Knopp algorithm which\nhas been shown to converge linearly in C(X ) [30, 41]. See also Thm. B.10 for a precise statement.\nWe recall a key result characterizing the differentiability of OT\" in terms of the Sinkhorn potentials\nthat will be useful in the following.\n\n2\n\n\fProposition 1 (Prop 2 in [21]). Let rOT\" : M+\n\n1 (X )2 !C (X )2 be such that, 8\u21b5,  2M 1\nu = T(v), v = T\u21b5(u) on X ,\nu(xo) = 0.\n\n+(X )\n\nwith\n\nrOT\"(\u21b5, ) = (u, v),\n\n(5)\n1 (X ), the directional derivative of\n(6)\n\nOT0\"(\u21b5, ; \u00b5, \u232b) = hrOT\"(\u21b5, ), (\u00b5, \u232b)i = hu, \u00b5i + hv, \u232bi ,\n\nThen, OT\" is directionally differentiable and, 8\u21b5, \u21b50,, 0 2M +\nOT\" at (\u21b5, ) along the feasible direction (\u00b5, \u232b) = (\u21b50  \u21b5, 0  ) is\nwhere hw, \u21e2i =R w(x) d\u21e2(x) denotes the canonical pairing between the spaces C(X ) and M(X ).\nNote that rOT\" is not a gradient in the standard sense. In particular note that the directional derivative\nin (6) is not de\ufb01ned for any pair of signed measures, but only along feasible directions (\u21b50\u21b5, 0).\nSinkhorn Divergence. The fast convergence of Sinkhorn-Knopp algorithm makes OT\" (with \"> 0)\npreferable to OT0 from a computational perspective [13]. However, when \"> 0 the entropic\nregularization introduces a bias in the optimal transport problem, since in general OT\"(\u00b5, \u00b5) 6= 0. To\ncompensate for this bias, [25] introduced the Sinkhorn divergence\n1\n2\n\n(\u21b5, ) 7! OT\"(\u21b5, ) \n\n1 (X ) \u21e5M +\n\nOT\"(\u21b5, \u21b5) \n\n1 (X ) ! R,\n\nS\" : M+\n\nOT\"(,  ),\n\n1\n2\n\n(7)\n\nwhich was shown in [21] to be nonnegative, biconvex and to metrize the convergence in law under\nmild assumptions. We characterize the gradient of S\"(\u00b7, ) for a \ufb01xed  2M +\n1 (X ), which will be\nkey to derive our optimization algorithm for computing Sinkhorn barycenters.\nRemark 2. Let r1OT\" : M+\n1 (X )2 !C (X ) denote the \ufb01rst component of rOT\" (informally the\ncomponent u of the Sinkhorn potentials (u, v)). Then, it follows from Prop. 1 and the de\ufb01nition\nof Sinkhorn divergence (7) that for any  2M +\n1 (X ) ! R is\ndirectionally differentiable and admits gradient\n\n1 (X ) the function S\"(\u00b7, ) : M+\n\nr[S\"(\u00b7, )] : M+\n\n1 (X ) !C (X )\n\n\u21b5 7! r1OT\"(\u21b5, ) \n\n1\n2r1OT\"(\u21b5, \u21b5) = u  p,\n\n(8)\n\nwith u = T\u21b5(u) and p = T\u21b5\u21b5(p) the Sinkhorn potentials of OT\"(\u21b5, ) and OT\"(\u21b5, \u21b5) respectively\nwhich are zero at xo.\n\nWe refer to Appendix C for an in-depth analysis of the directional differentiability properties of the\nSinkorn divergence.\n\n3 Sinkhorn barycenters with Frank-Wolfe\n\nGiven 1, . . . m 2M +\ngoal of this paper is to solve the following Sinkhorn barycenter problem\n\n1 (X ) and !1, . . . ,! m  0 a set of weights such thatPm\n\nj=1 !j = 1, the main\n\n(9)\n\nwith\n\n1 (X )\n\nB\"(\u21b5),\n\nB\"(\u21b5) =\n\n!j S\"(\u21b5, j).\n\nmin\n\u21b52M+\n\nmXj=1\n+(X ) has empty interior in the space of\nAlthough the objective functional B\" is convex, its domain M1\n\ufb01nite signed measure M(X ). Hence standard notions of Fr\u00e9chet or G\u00e2teaux differentiability do not\napply. This, in principle causes some dif\ufb01culties in devising optimization methods. To circumvent\nthis issue, in this work we adopt the Frank-Wolfe (FW) algorithm. Indeed, one key advantage of\nthis method is that it is formulated in terms of directional derivatives along feasible directions (i.e.,\ndirections that locally remain inside the constraint set). Building upon [15, 16, 19], which study the\nalgorithm in Banach spaces, we show that the \u201cweak\u201d notion of directional differentiability of S\"\n(and hence of B\") in Remark 2 is suf\ufb01cient to carry out the convergence analysis. While full details\nare provided in Appendix A, below we give an overview of the main result.\nFrank-Wolfe in dual Banach spaces. Let W be a real Banach space with topological dual W\u21e4\nand let D\u21e2W \u21e4 be a nonempty, convex, closed and bounded set. For any w 2W \u21e4 denote by\nFD(w) = R+(D w) the set of feasible direction of D at w (namely s = t(w0  w) with w0 2D\nand t > 0). Let G : D! R be a convex function and assume that there exists a map rG : D!W\n(not necessarily unique) such that hrG(w), si = G0(w; s) for every s 2F D(w). In Alg. 1 we present\n\n3\n\n\fAlgorithm 1 FRANK-WOLFE IN DUAL BANACH SPACES\nInput: initial w0 2D , precision (k)k2N 2 RN\nFor k = 0, 1, . . .\n\n++, such that k(k + 2) is nondecreasing.\n\nTake zk+1 such that G0(wk, zk+1  wk) \uf8ff minz2D G0(wk, z  wk) + k\nwk+1 = wk + 2\n\n2\n\nk+2 (zk+1  wk)\n\na method to minimize G. The algorithm is structurally equivalent to the standard FW [19, 27] and\naccounts for possible inaccuracies when computing the conditional gradient (i.e. solving the FW\ninner minimization). This will be key in Sec. 5 when studying the barycenter problem for j with\nin\ufb01nite support. The following result (see proof in Appendix A) shows that under the additional\nassumption that rG is Lipschitz-continuous and with suf\ufb01ciently fast decay of the errors, the above\nprocedure converges in value to the minimum of G with rate O(1/k). Here diam(D) denotes the\ndiameter of D with respect to the dual norm.\nTheorem 3. Under the assumptions above, suppose in addition that rG is L-Lipschitz continuous\nwith L > 0. Let (wk)k2N and (k)k2N be de\ufb01ned according to Alg. 1. Then, for every integer k  1,\n(10)\n\n2\n\nG(wk)  min\nw2D\n\nG(w) \uf8ff\n\nL diam(D)2 + k.\n\nk + 2\n\nFrank-Wolfe Sinkhorn barycenters. We show that the barycenter problem (9) satis\ufb01es the setting\nand hypotheses of Thm. 3 and can be thus approached via Alg. 1.\nOptimization domain. Let W = C(X ), with dual W\u21e4 = M(X ). The constraint set D = M+\nconvex, closed, and bounded.\nObjective functional. The objective functional G = B\" : M+\nsince it is a convex combination of S\"(\u00b7, j), with j = 1 . . . m. The gradient rB\" : M+\nis rB\" =Pm\nLipschitz continuity of the gradient. This is the most critical condition and it is studied in the following\ntheorem.\nTheorem 4. The gradient rOT\" de\ufb01ned in Prop. 1 is Lipschitz continuous. In particular, the \ufb01rst\ncomponent r1OT\" is 2\"e3D/\"-Lipschitz continuous, i.e., for every \u21b5, \u21b50,, 0 2M +\n\n1 (X ) is\n1 (X ) ! R, de\ufb01ned in (9), is convex\n1 (X ) !C (X )\n\nj=1 !j rS\"(\u00b7, j), where rS\"(\u00b7, j) is given in Remark 2.\n\n1 (X ),\n\nku  u0k1 = kr1OT\"(\u21b5, )  r1OT\"(\u21b50, 0)k1 \uf8ff 2\"e3D/\" (k\u21b5  \u21b50kT V + k  0kT V ),\n\n(11)\nwhere D = supx,y2X c(x, y), u = T\u21b5(u), u0 = T0,\u21b50(u0), and u(xo) = u0(xo) = 0. Moreover, it\nfollows from (8) that rS\"(\u00b7, ) is 6\"e3D/\"-Lipschitz continuous. The same holds for rB\".\nThm. 4 is one of the main contributions of this paper. It can be rephrased by saying that the operator\nthat maps a pair of distributions to their Sinkhorn potentials is Lipschitz continuous. This result is\nsigni\ufb01cantly deeper than the one given in [20, Lemma 1], which establishes the Lipschitz continuity\nof the gradient in the semidiscrete case. The proof (given in Appendix D) relies on non-trivial tools\nfrom Perron-Frobenius theory for Hilbert\u2019s metric [32], which is a well-established framework to\nstudy Sinkhorn potentials [38]. We believe this result is interesting not only for the application of\nFW to the Sinkhorn barycenter problem, but also for further understanding regularity properties of\nentropic optimal transport.\n\n4 Algorithm: practical Sinkhorn barycenters\n\nAccording to Sec. 3, FW is a valid approach to tackle the barycenter problem (9). Here we describe\nhow to implement in practice the abstract procedure of Alg. 1 to obtain a sequence of distributions\n(\u21b5k)k2N minimizing B\". A main challenge in this sense resides in \ufb01nding a minimizing feasible\ndirection for B0\"(\u21b5k; \u00b5  \u21b5k) = hrB\"(\u21b5k), \u00b5  \u21b5ki. According to Remark 2, this amounts to solve\n(12)\n\nwhere\n\nujk  pk = rS\"[(\u00b7, j)](\u21b5k),\n\n!j hujk  pk, \u00b5i\n\n\u00b5k+1 2 argmin\n1 (X )\n\n\u00b52M+\n\nmXj=1\n\n4\n\n\fAlgorithm 2 SINKHORN BARYCENTER\n\nInput: j = (Yj, bj) with Yj 2 Rd\u21e5nj , bj 2 Rnj ,! j > 0 for j = 1, . . . , m, x0 2 Rd, \"> 0, K 2 N.\nInitialize: \u21b50 = (X0, a0) with X0 = x0, a0 = 1.\nFor k = 0, 1, . . . , K  1\n\np = SINKHORNKNOPP(\u21b5k,\u21b5 k,\" )\np(\u00b7) = SINKHORNGRADIENT(Xk, ak, p)\nFor j = 1, . . . m\nvj = SINKHORNKNOPP(\u21b5k, j,\" )\nuj(\u00b7) = SINKHORNGRADIENT(Yj, bj, vj)\n\nLet ' : x 7!Pm\n\nj=1 !j uj(x)  p(x)\nxk+1 = MINIMIZE(')\nXk+1 = [Xk, xk+1] and ak+1 = 1\n\u21b5k+1 = (Xk+1, ak+1)\n\nk+2 [k ak, 2]\n\nReturn: \u21b5K\n\nwith pk = r1OT\"(\u21b5k,\u21b5 k) not depending on j. In general (12) would entail a minimization over\nthe set of all probability distributions on X . However, since the objective functional is linear in \u00b5\nand M+\n1 (X ) is a weakly-\u21e4 compact convex set, we can apply Bauer maximum principle (see e.g.,\n[3, Thm. 7.69]). Hence, solutions are achieved at the extreme points of the optimization domain.\nThese correspond to Dirac\u2019s deltas in the case of M+\n1 (X )\nthe Dirac\u2019s delta centered at x 2X . We have hw, xi = w(x) for every w 2C (X ). Hence (12) is\nequivalent to\n\n1 (X ) [11, p. 108]. Denote by x 2M +\n\n\u00b5k+1 = xk+1\n\nwith\n\nxk+1 2 argmin\nx2X\n\nmXj=1\n\n!jujk(x)  pk(x).\n\n(13)\n\nOnce the new support point xk+1 has been obtained, the update in Alg. 1 corresponds to\n\n2\n\nk + 2\n\nk\n\nk + 2\n\n2\n\nk + 2\n\n\u21b5k +\n\nxk+1.\n\n\u21b5k+1 = \u21b5k +\n\n(xk+1  \u21b5k) =\n\n(14)\nIf FW is initialized with a Dirac\u2019s delta \u21b50 = x0 for some x0 2X , then every further iterate \u21b5k\nwill have at most k + 1 support points. According to (13), the inner optimization for FW consists in\nj=1 !jujk(x)  pk(x) over X . In practice, having access to\nminimizing the functional x 7!Pm\n\nsuch functional poses already a challenge, since it requires computing the Sinkhorn potentials ujk\nand pk, which are in\ufb01nite dimensional objects. Below we discuss how to estimate these potentials\nwhen the j have \ufb01nite support. We then address the general setting.\nComputing r1OT\" for probability distributions with \ufb01nite support. Let \u21b5,  2M +\n1 (X ), where\n =Pn\ni=1 nonnegative weights summing up to 1. It is useful to identify \nwith the pair (Y, b), where Y 2 Rd\u21e5n is the matrix with i-th column equal to yi. Let (u, v) 2C (X )2\nbe the pair of Sinkhorn potentials associated to \u21b5 and  in Prop. 1, recall that u = T(v). Denote by\nv 2 Rn the evaluation vector of the Sinkhorn potential v, with i-th entry vi = v(yi). According to\nthe de\ufb01nition of T in (3), for any x 2X\n\ni=1 biyi and b = (bi)n\n\n[r1OT\"(\u21b5, )](x) = u(x) = [T(v)](x) = \" log\n\ne(vic(x,yi))/\" bi,\n\n(15)\n\nnXi=1\n\nsince the integral T(v) reduces to a sum over the support of . Hence, the gradient of OT\" (i.e.\nthe potential u), is uniquely characterized in terms of the \ufb01nite dimensional vector v collecting the\nvalues of the potential v on the support of  . We refer as SINKHORNGRADIENT to the routine which\n\nassociates to each triplet (Y, b, v) the map x 7! \" logPn\nSinkhorn barycenters: \ufb01nite case. Alg. 2 summarizes FW applied to the barycenter problem (9)\nwhen the j\u2019s have \ufb01nite support. Starting from a Dirac\u2019s delta \u21b50 = x0, at each iteration k 2 N the\nalgorithm proceeds by: i) \ufb01nding the corresponding evaluation vectors vj\u2019s and p of the Sinkhorn\npotentials for OT\"(\u21b5k, j) and OT\"(\u21b5k,\u21b5 k) respectively, via the routine SINKHORNKNOPP (see\n[13, 21] or Alg. B.2). This is possible since both j and \u21b5k have \ufb01nite support and therefore the\n\ni=1 e(vic(x,yi))/\" bi.\n\n5\n\n\fproblem of approximating the evaluation vectors vj and p reduces to an optimization problem over\n\ufb01nite vector spaces that can be ef\ufb01ciently solved [13]; ii) obtain the gradients uj = r1OT\"(\u21b5k, j)\nand p = r1OT\"(\u21b5k,\u21b5 k) via SINKHORNGRADIENT; iii) minimize ' : x 7!Pn\nj=1 !j uj(x)  p(x)\nover X to \ufb01nd a new point xk+1 (we comment on this meta-routine MINIMIZE below); iv) \ufb01nally\nupdate the support and weights of \u21b5k according to (14) to obtain the new iterate \u21b5k+1.\nA key feature of Alg. 2 is that the support of the candidate barycenter is updated incrementally\nby adding at most one point at each iteration, a procedure similar in \ufb02avor to the kernel herding\nstrategy in [5, 31] and conditional gradient for sparse inverse problem [8, 9]. This contrasts with\nprevious methods for barycenter estimation [6, 14, 20, 50], which require the support set, or at least\nits cardinality, to be \ufb01xed beforehand. However, indentifying the new support point requires solving\nthe nonconvex problem (13), a task addressed by the meta-routine MINIMIZE. This problem is\ntypically smooth (e.g., a linear combination of Gaussians when c(x, y) = kx  yk2) and \ufb01rst or\nsecond order nonlinear optimization methods can be adopted to \ufb01nd stationary points. We note that\nall free-support methods in the literature for barycenter estimation are also affected by nonconvexity\nsince they typically require solving a biconvex problem (alternating minimization between support\npoints and weights) which is not jointly convex [12, 14]. We conclude by observing that if we restrict\nto the setting of [20, 50] with \ufb01xed \ufb01nite support set, then MINIMIZE can be solved exactly by\nevaluating the functional in (13) on each candidate support point.\nSinkhorn barycenters: general case. When the j\u2019s have in\ufb01nite support, it is not possible to apply\nSinkhorn-Knopp in practice. In line with [23, 50], we can randomly sample empirical distributions\n\u02c6j = 1\ni=1 xij from each j and apply Sinkhorn-Knopp to (\u21b5k, \u02c6j) in Alg. 1 rather than to the\nideal pair (\u21b5k, j). This strategy is motivated by [21, Prop 13], where it was shown that Sinkhorn\npotentials vary continuously with the input measures. However, it opens two questions: i) whether\nthis approach is theoretically justi\ufb01ed (consistency) and ii) how many points should we sample from\neach j to ensure convergence (rates). We answer these questions in Thm. 7 in the next section.\n\nnPn\n\n5 Convergence analysis\n\nWe \ufb01nally address the convergence of FW applied to both the \ufb01nite and in\ufb01nite settings discussed in\nSec. 4. We begin by considering the \ufb01nite setting.\nTheorem 5. Suppose that 1, . . . m 2M +\nAlg. 2 applied to (9). Then,\n\n1 (X ) have \ufb01nite support and let \u21b5k be the k-th iterate of\n\nB\"(\u21b5k)  min\n\u21b52M+\n\n1 (X )\n\nB\"(\u21b5) \uf8ff\n\n48 \"e 3D/\"\n\nk + 2\n\n.\n\n(16)\n\nThe result follows by the convergence result of FW in Thm. 3 applied with the Lipschitz constant\nfrom Thm. 4, and recalling that diam(M+\n1 (X )) = 2 with respect to the Total Variation. Note that\nThm. 5 assumes SINKHORNKNOPP and MINIMIZE in Alg. 2 to yield exact solutions. In Appendix D\nwe extend of Alg. 2 and Thm. 5 which account for approximation errors in the above routines.\nGeneral setting. As mentioned in Sec. 4, when the j\u2019s are not \ufb01nitely supported we adopt a\nsampling approach. More precisely we propose to replace in Alg. 2 the ideal Sinkhorn potentials of\nthe pairs (\u21b5, j) with those of (\u21b5, \u02c6j), where each \u02c6j is an empirical measure randomly sampled from\nj. In other words we are performing the FW algorithm with a (possibly rough) approximation of\nthe correct gradient of B\". According to Thm. 3, FW allows errors in the gradient estimation (which\nare captured into the precision k in the statement). To this end, the following result quanti\ufb01es the\napproximation error between r1OT\"(\u00b7, ) and r1OT\"(\u00b7, \u02c6) in terms of the sample size of \u02c6.\nTheorem 6 (Sample Complexity of Sinkhorn Potentials). Suppose that c 2C s+1(X\u21e5X ) with\ns > d/2. Then, there exists a constant r = r(X , c, d) such that for any \u21b5,  2M +\n1 (X ) and any\nempirical measure \u02c6 of a set of n points independently sampled from , we have, for every \u2327 2 (0, 1]\n(17)\n\nku  unk1 = kr1OT\"(\u21b5, )  r1OT\"(\u21b5, \u02c6)k1 \uf8ff\n\n8\" re3D/\" log 3\n\u2327\n\npn\n\nwith probability at least 1  \u2327, where u = T\u21b5(u), un = T \u02c6\u21b5(un) and u(xo) = un(xo) = 0.\n\n6\n\n\fFig. 1: Barycenter of nested ellipses\n\nFig. 2: Barycenters of Gaussians (see text)\n\nThe result in Thm. 6 is of central importance in this work. We point out that it cannot be obtained by\nmeans of the Lipschitz continuity of r1OT\" in Thm. 4, since empirical measures do not converge in\nk\u00b7kT V to their target distribution [17]. Instead, the proof consists in considering the weaker Maximum\nMean Discrepancy (MMD) metric associated to a universal kernel [46], which metrizes the topology\nof the convergence in law of M+\n1 (X ) [47]. Empirical measures converge in MMD metric to their\ntarget distribution [46]. Therefore, by proving the Lipschitz continuity of r1OT\" with respect to\nMMD (see Prop. E.5) we are able to conclude that (17) holds. This latter result relies on regularity\nproperties of Sinkhorn potentials, which have been recently shown [23, Thm.2] to be uniformly\nbounded in Sobolev spaces under the additional assumption c 2C s+1(X\u21e5X ). For suf\ufb01ciently large\ns, the Sobolev norm is in duality with the MMD [35] and allows us to derive the required Lipschitz\ncontinuity. We conclude noting that while [23] studied the sample complexity of the Sinkhorn\ndivergence, Thm. 6 is a sample complexity result for Sinkhorn potentials. In this sense, we observe\nthat the constants appearing in the bound are tightly related to those in [23, Thm.3] and have similar\nbehavior with respect to \". We can now study the convergence of FW in continuous settings.\nTheorem 7. Suppose that c 2C s+1(X\u21e5X ) with s > d/2. Let n 2 N and \u02c61, . . . , \u02c6m be empirical\ndistributions with n support points, each independently sampled from 1, . . . , m. Let \u21b5k be the k-th\niterate of Alg. 2 applied to \u02c61, . . . , \u02c6m. Then for any \u2327 2 (0, 1], the following holds with probability\nlarger than 1  \u2327\n\n64\u00afr\"e3D/\" log 3m\n\u2327\n\nmin(k,pn)\n\nB\"(\u21b5k)  min\n\u21b52M+\n\n1 (X )\n\nB\"(\u21b5) \uf8ff\n\n.\n\n(18)\n\nThe proof is shown in Appendix E. A consequence of Thm. 7 is that the accuracy of FW depends\nsimultaneously on the number of iterations and the sample size used in the approximation of the\ngradients: by choosing n = k2 we recover the O(1/k) rate of the \ufb01nite setting, while for n = k we\nhave a rate of O(k1/2), which is reminiscent of typical sample complexity results, highlighting the\nstatistical nature of the problem.\nRemark 8 (Incremental Sampling). The above strategy requires sampling the empirical distributions\nfor 1, . . . , m beforehand. A natural question is whether it is be possible to do this incrementally,\nsampling new points and updating \u02c6j accordingly, as the number of FW iterations increase. To this\nend, one can perform an intersection bound and see that this strategy is still consistent, but the bound\nin Thm. 7 worsens the logarithmic term, which becomes log(3mk/\u2327 ).\n\n6 Experiments\n\nIn this section we show the performance of our method in a range of experiments. Additional\nexperiments are provided in the supplementary material. Code has been made publicly available1.\nDiscrete measures: barycenter of nested ellipses. We compute the barycenter of 30 randomly\ngenerated nested ellipses on a 50 \u21e5 50 grid similarly to [14]. We interpret each image as a probability\ndistribution in 2D. The cost matrix is given by the squared Euclidean distances between pixels. Fig. 1\nreports 8 samples of the input ellipses and the barycenter obtained with Alg. 2. It shows qualitatively\nthat our approach captures key geometric properties of the input measures.\n\n1 https://github.com/GiulsLu/Sinkhorn-Barycenters\n\n7\n\n\fFig. 3: Matching of a 140x140 image. 5000 FW iterations Fig. 4: MNIST k-means (20 centers)\n\nContinuous measures: barycenter of Gaussians. We compute the barycenter of 5 Gaussian\ndistributions N (mi, Ci) i = 1, . . . , 5 in R2, with mean mi 2 R2 and covariance Ci randomly\ngenerated. We apply Alg. 2 to empirical measures obtained by sampling n = 500 points from\neach N (mi, Ci), i = 1, . . . , 5. Since the (Wasserstein) barycenter of Gaussian distributions can be\nestimated accurately (see [2]), in Fig. 2 we report both the output of our method (as a scatter plot) and\nthe true Wasserstein barycenter (as level sets of its density). We observe that our estimator recovers\nboth the mean and covariance of the target barycenter. See the supplementary material for additional\nexperiments also in the case of mixtures of Gaussians.\nImage \u201ccompression\u201d via distribution matching. Similarly to [12], we test Alg. 2 in the special\ncase of computing the \u201cbarycenter\u201d of a single measure  2M 1\n+(X ). While the solution of\nthis problem is the distribution  itself, we can interpret the intermediate iterates \u21b5k of Alg. 2 as\ncompressed version of the original measure. In this sense k would represent the level of compression\nsince \u21b5k is supported on at most k points. Fig. 3 (Right) reports iteration k = 5000 of Alg. 2 applied\nto the 140 \u21e5 140 image in Fig. 3 (Left) interpreted as a probability measure  in 2D. We note that the\nnumber of points in the support is \u21e0 3900: indeed, Alg. 2 selects the most relevant support points\npoints multiple times to accumulate the right amount of mass on each of them (darker color = higher\nweight). This shows that FW tends to greedily search for the most relevant support points, prioritizing\nthose with higher weight.\nk-means on MNIST digits. We tested our algorithm on a k-means clustering experiment. We\nconsider a subset of 500 random images from the MNIST dataset. Each image is suitably normalized\nto be interpreted as a probability distribution on the grid of 28\u21e5 28 pixels with values scaled between\n0 and 1. We initialize 20 centroids according to the k-means++ strategy [4]. Fig. 4 depicts the 20\ncentroids obtained by performing k-means with Alg. 2. We see that the structure of the digits is\nsuccessfully detected, recovering also minor details (e.g. note the difference between the 2 centroids).\nReal data: Sinkhorn propagation of weather data. We consider the problem of Sinkhorn propa-\ngation similar to the one in [45]. The goal is to predict the distribution of missing measurements for\nweather stations in the state of Texas, US by \u201cpropagating\u201d measurements from neighboring stations\n\nin the network. The problem can be formulated as minimizing the functionalP(v,u)2V !uvS\"(\u21e2v,\u21e2 u)\nover the set {\u21e2v 2M +\n1 (R2)|v 2V 0} with: V0 \u21e2V the subset of stations with missing measurements,\nG = (V,E) the whole graph of the stations network, !uv a weight inversely proportional to the\ngeographical distance between two vertices/stations u, v 2V . The variable \u21e2v 2M +\n1 (R2) denotes\nthe distribution of measurements at station v of daily temperature and atmospheric pressure over one\nyear. This is a generalization of the barycenter problem (9) (see also [38]).\nFrom the total |V| = 115, we randomly select 10%, 20% or 30% to be available stations, and\nuse Alg. 2 to propagate their measurements to the remaining \u201cmissing\u201d ones. We compare our\napproach (FW) with the Dirichlet (DR) baseline in [45] in terms of the error d(CT , \u02c6C) between the\ncovariance matrix CT of the groundtruth distribution and that of the predicted one. Here d(A, B) =\nklog(A1/2BA1/2)k is the geodesic distance on the cone of positive de\ufb01nite matrices. The average\nprediction errors are: 2.07 (FW), 2.24 (DR) for 10%, 1.47 (FW), 1.89(DR) for 20% and 1.3 (FW),\n1.6 (DR) for 30%. Fig. 5 qualitatively reports the improvement = d(CT , CDR)  d(CT , CF W )\nof our method on individual stations: a higher color intensity corresponds to a wider gap in our\nfavor between prediction errors, from light green ( \u21e0 0) to red ( \u21e0 2). Our approach tends to\npropagate the distributions to missing locations with higher accuracy.\n\n8\n\n\fFig. 5: From Left to Right: propagation of weather data with 10%, 20% and 30% stations with\navailable measurements (see text).\n\n7 Conclusion\n\nWe proposed a Frank-Wolfe-based algorithm to \ufb01nd the Sinkhorn barycenter of probability distribu-\ntions with either \ufb01nitely or in\ufb01nitely many support points. Our algorithm belongs to the family of\nbarycenter methods with free support since it adaptively identi\ufb01es support points rather than \ufb01xing\nthem a-priori. In the \ufb01nite settings, we were able to guarantee convergence of the proposed algorithm\nby proving the Lipschitz continuity of gradient of the barycenter functional in the Total Variation\nsense. Then, by studying the sample complexity of Sinkhorn potential estimation, we proved the\nconvergence of our algorithm also in the in\ufb01nite case. We empirically assessed our method on a\nnumber of synthetic and real experiments, showing that it exhibits good qualitative and quantitative\nperformance. While in this work we have considered FW iterates that are a convex combination of\nDirac\u2019s delta, models with higher regularity (e.g. mixture of Gaussians) might be more suited to\napproximate the barycenter of distributions with smooth density. Hence, in the future we plan to\ninvestigate whether the perspective adopted in this work could be extended also to other barycenter\nestimators.\n\nReferences\n[1] R. A. Adams and J. J. F. Fournier. Sobolev spaces. Elsevier, 2003.\n\n[2] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM J. Math. Analysis,\n\n43(2):904\u2013924, 2011.\n\n[3] K. Aliprantis, C. D. and Border. In\ufb01nite Dimensional Analysis: a Hitchhiker\u2019s guide. Springer\n\nScience & Business Media, 2006.\n\n[4] David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In\nProceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA\n\u201907, pages 1027\u20131035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathe-\nmatics.\n\n[5] Francis Bach, Simon Lacoste-Julien, and Guillaume Obozinski. On the equivalence between\n\nherding and conditional gradient algorithms. arXiv preprint arXiv:1203.4523, 2012.\n\n[6] Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyr\u00e9.\nIterative bregman projections for regularized transportation problems. SIAM J. Scienti\ufb01c\nComputing, 37(2), 2015.\n\n[7] J Fr\u00e9d\u00e9ric Bonnans and Alexander Shapiro. Perturbation analysis of optimization problems.\n\nSpringer Science & Business Media, 2013.\n\n[8] Nicholas Boyd, Geoffrey Schiebinger, and Benjamin Recht. The alternating descent conditional\ngradient method for sparse inverse problems. SIAM Journal on Optimization, 27(2):616\u2013639,\n2017.\n\n[9] Kristian Bredies and Hanna Katriina Pikkarainen. Inverse problems in spaces of measures.\n\nESAIM: Control, Optimisation and Calculus of Variations, 19(1):190\u2013218, 2013.\n\n9\n\n\f[10] Lenaic Chizat, Gabriel Peyr\u00e9, Bernhard Schmitzer, and Fran\u00e7ois-Xavier Vialard. Scaling algo-\nrithms for unbalanced optimal transport problems. Mathematics of Computation, 87(314):2563\u2013\n2609, 2018.\n\n[11] G. Chouquet. Lectures on Analysis, Vol. II. W. A. Bejamin, Inc., Reading, MA, USA., 1969.\n\n[12] S. Claici, E. Chien, and J. Solomon. Stochastic Wasserstein Barycenters. ArXiv e-prints,\n\nFebruary 2018.\n\n[13] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances\n\nin Neural Information Processing Systems, pages 2292\u20132300, 2013.\n\n[14] Marco Cuturi and Arnaud Doucet. Fast computation of wasserstein barycenters. In Eric P.\nXing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine\nLearning, volume 32 of Proceedings of Machine Learning Research, pages 685\u2013693, Bejing,\nChina, 22\u201324 Jun 2014. PMLR.\n\n[15] V. F. Demyanov and A. M. Rubinov. The minimization of smooth convex functional on a convex\n\nset. J. SIAM Control., 5(2):280\u2013294, 1967.\n\n[16] V. F. Demyanov and A. M. Rubinov. Minimization of functionals in normed spaces. J. SIAM\n\nControl., 6(1):73\u201388, 1968.\n\n[17] Luc Devroye, Laszlo Gyor\ufb01, et al. No empirical probability measure can converge in the total\n\nvariation sense for all distributions. The Annals of Statistics, 18(3):1496\u20131499, 1990.\n\n[18] Pierre Dognin, Igor Melnyk, Youssef Mroueh, Jerret Ross, Cicero Dos Santos, and Tom Sercu.\n\nWasserstein barycenter model ensembling. arXiv preprint arXiv:1902.04999, 2019.\n\n[19] Joseph C Dunn and S Harshbarger. Conditional gradient algorithms with open loop step size\n\nrules. Journal of Mathematical Analysis and Applications, 62(2):432\u2013444, 1978.\n\n[20] Pavel Dvurechenskii, Darina Dvinskikh, Alexander Gasnikov, Cesar Uribe, and Angelia Nedich.\nDecentralize and randomize: Faster algorithm for wasserstein barycenters.\nIn S. Bengio,\nH. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 31, pages 10760\u201310770. Curran Associates, Inc., 2018.\n\n[21] Jean Feydy, Thibault S\u00e9journ\u00e9, Fran\u00e7ois-Xavier Vialard, Shun-Ichi Amari, Alain Trouv\u00e9, and\nGabriel Peyr\u00e9. Interpolating between optimal transport and mmd using sinkhorn divergences.\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AIStats), 2019.\n\n[22] J. Franklin and J. Lorenz. On the scaling of multidimensional matrices. Linear Algebra and its\n\nApplications, 114:717\u2013735, 1989.\n\n[23] Aude Genevay, L\u00e9naic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyr\u00e9. Sample\ncomplexity of sinkhorn divergences. International Conference on Arti\ufb01cial Intelligence and\nStatistics (AIStats), 2018.\n\n[24] Aude Genevay, Marco Cuturi, Gabriel Peyr\u00e9, and Francis Bach. Stochastic optimization for\nlarge-scale optimal transport.\nIn D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3440\u20133448.\nCurran Associates, Inc., 2016.\n\n[25] Aude Genevay, Gabriel Peyr\u00e9, and Marco Cuturi. Learning generative models with sinkhorn\ndivergences. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1608\u2013\n1617, 2018.\n\n[26] Alexandre Gramfort, Gabriel Peyr\u00e9, and Marco Cuturi. Fast optimal transport averaging of\nneuroimaging data. In International Conference on Information Processing in Medical Imaging,\npages 261\u2013272. Springer, 2015.\n\n[27] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1),\n\npages 427\u2013435, 2013.\n\n10\n\n\f[28] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Interna-\n\ntional Conference on Machine Learning, pages 427\u2013435, 2013.\n\n[29] L Kantorovich. On the transfer of masses (in russian). Doklady Akademii Nauk USSR, 1942.\n\n[30] Paul Knopp and Richard Sinkhorn. A note concerning simultaneous integral equations. Cana-\n\ndian Journal of Mathematics, 20:855\u2013861, 1968.\n\n[31] Simon Lacoste-Julien, Fredrik Lindsten, and Francis Bach. Sequential kernel herding: Frank-\n\nwolfe optimization for particle \ufb01ltering. arXiv preprint arXiv:1501.02056, 2015.\n\n[32] Bas Lemmens and Roger Nussbaum. Nonlinear Perron-Frobenius Theory, volume 189. Cam-\n\nbridge University Press, 2012.\n\n[33] Bas Lemmens and Roger Nussbaum. Birkhoff\u2019s version of Hilbert\u2019s metric and its applications\n\nin analysis. arXiv preprint arXiv:1304.7921, 2013.\n\n[34] M. V Menon. Reduction of a matrix with positive elements to a doubly stochastic matrix. Proc.\n\nAmer. Math. Soc., 18:244\u2013247, 1967.\n\n[35] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Sch\u00f6lkopf, et al. Kernel\nmean embedding of distributions: A review and beyond. Foundations and Trends R in Machine\nLearning, 10(1-2):1\u2013141, 2017.\n\n[36] Roger Nussbaum. Hilbert\u2019s projective metric and iterated nonlinear maps. Mem. Amer. Math.\n\nSoc., 391:1\u2013137, 1988.\n\n[37] Roger Nussbaum. Entropy minimization, Hilbert\u2019s projective metric and scaling integral kernels.\n\nJournal of Functional Analysis, 115:45\u201399, 1993.\n\n[38] Gabriel Peyr\u00e9 and Marco Cuturi. Computational optimal transport. Foundations and Trends R\n\nin Machine Learning, 11(5-6):355\u2013607, 2019.\n\n[39] Iosif Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The Annals\n\nof Probability, pages 1679\u20131706, 1994.\n\n[40] Julien Rabin, Gabriel Peyr\u00e9, Julie Delon, and Marc Bernot. Wasserstein barycenter and its\napplication to texture mixing. In International Conference on Scale Space and Variational\nMethods in Computer Vision, pages 435\u2013446. Springer, 2011.\n\n[41] Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic\n\nmatrices. Paci\ufb01c J. Math., 21(2):343\u2013348, 1967.\n\n[42] Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their\n\napproximations. Constructive approximation, 26(2):153\u2013172, 2007.\n\n[43] Alex Smola, Arthur Gretton, Le Song, and Bernhard Sch\u00f6lkopf. A Hilbert space embedding\nfor distributions. In International Conference on Algorithmic Learning Theory, pages 13\u201331.\nSpringer, 2007.\n\n[44] Justin Solomon, Fernando De Goes, Gabriel Peyr\u00e9, Marco Cuturi, Adrian Butscher, Andy\nNguyen, Tao Du, and Leonidas Guibas. Convolutional wasserstein distances: Ef\ufb01cient optimal\ntransportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):66, 2015.\n\n[45] Justin Solomon, Raif M. Rustamov, Leonidas Guibas, and Adrian Butscher. Wasserstein\npropagation for semi-supervised learning. In Proceedings of the 31st International Conference\non International Conference on Machine Learning - Volume 32, ICML\u201914, pages I\u2013306\u2013I\u2013314.\nJMLR.org, 2014.\n\n[46] Le Song. Learning via Hilbert space embedding of distributions. 2008.\n\n[47] Bharath K Sriperumbudur, Kenji Fukumizu, and Gert RG Lanckriet. Universality, charac-\nteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research,\n12(Jul):2389\u20132410, 2011.\n\n11\n\n\f[48] Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch\u00f6lkopf, and Gert RG\nLanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine\nLearning Research, 11(Apr):1517\u20131561, 2010.\n\n[49] Sanvesh Srivastava, Cheng Li, and David B Dunson. Scalable bayes via barycenter in wasserstein\n\nspace. The Journal of Machine Learning Research, 19(1):312\u2013346, 2018.\n\n[50] Matthew Staib, Sebastian Claici, Justin M Solomon, and Stefanie Jegelka. Parallel streaming\nwasserstein barycenters. In Advances in Neural Information Processing Systems, pages 2647\u2013\n2658, 2017.\n\n[51] Elias M Stein. Singular integrals and differentiability properties of functions (PMS-30), vol-\n\nume 30. Princeton university press, 2016.\n\n[52] C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften.\n\nSpringer Berlin Heidelberg, 2008.\n\n[53] Holger Wendland. Scattered data approximation, volume 17. Cambridge university press, 2004.\n[54] J. Ye, P. Wu, J. Z. Wang, and J. Li. Fast discrete distribution clustering using wasserstein\nbarycenter with sparse support. IEEE Transactions on Signal Processing, 65(9):2317\u20132332,\nMay 2017.\n\n[55] VV Yurinskii. Exponential inequalities for sums of random vectors. Journal of multivariate\n\nanalysis, 6(4):473\u2013499, 1976.\n\n12\n\n\f", "award": [], "sourceid": 4979, "authors": [{"given_name": "Giulia", "family_name": "Luise", "institution": "University College London"}, {"given_name": "Saverio", "family_name": "Salzo", "institution": "Istituto Italiano di Tecnologia"}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": "IIT & UCL"}, {"given_name": "Carlo", "family_name": "Ciliberto", "institution": "Imperial College London"}]}