{"title": "Parallel Streaming Wasserstein Barycenters", "book": "Advances in Neural Information Processing Systems", "page_first": 2647, "page_last": 2658, "abstract": "Efficiently aggregating data from different sources is a challenging problem, particularly when samples from each source are distributed differently. These differences can be inherent to the inference task or present for other reasons: sensors in a sensor network may be placed far apart, affecting their individual measurements. Conversely, it is computationally advantageous to split Bayesian inference tasks across subsets of data, but data need not be identically distributed across subsets. One principled way to fuse probability distributions is via the lens of optimal transport: the Wasserstein barycenter is a single distribution that summarizes a collection of input measures while respecting their geometry. However, computing the barycenter scales poorly and requires discretization of all input distributions and the barycenter itself. Improving on this situation, we present a scalable, communication-efficient, parallel algorithm for computing the Wasserstein barycenter of arbitrary distributions. Our algorithm can operate directly on continuous input distributions and is optimized for streaming data. Our method is even robust to nonstationary input distributions and produces a barycenter estimate that tracks the input measures over time. The algorithm is semi-discrete, needing to discretize only the barycenter estimate. To the best of our knowledge, we also provide the first bounds on the quality of the approximate barycenter as the discretization becomes finer. Finally, we demonstrate the practical effectiveness of our method, both in tracking moving distributions on a sphere, as well as in a large-scale Bayesian inference task.", "full_text": "Parallel Streaming Wasserstein Barycenters\n\nMatthew Staib\nMIT CSAIL\n\nmstaib@mit.edu\n\nSebastian Claici\n\nMIT CSAIL\n\nsclaici@mit.edu\n\nJustin Solomon\n\nMIT CSAIL\n\njsolomon@mit.edu\n\nStefanie Jegelka\n\nMIT CSAIL\n\nstefje@mit.edu\n\nAbstract\n\nEf\ufb01ciently aggregating data from different sources is a challenging problem, partic-\nularly when samples from each source are distributed differently. These differences\ncan be inherent to the inference task or present for other reasons: sensors in a sensor\nnetwork may be placed far apart, affecting their individual measurements. Con-\nversely, it is computationally advantageous to split Bayesian inference tasks across\nsubsets of data, but data need not be identically distributed across subsets. One prin-\ncipled way to fuse probability distributions is via the lens of optimal transport: the\nWasserstein barycenter is a single distribution that summarizes a collection of input\nmeasures while respecting their geometry. However, computing the barycenter\nscales poorly and requires discretization of all input distributions and the barycenter\nitself. Improving on this situation, we present a scalable, communication-ef\ufb01cient,\nparallel algorithm for computing the Wasserstein barycenter of arbitrary distribu-\ntions. Our algorithm can operate directly on continuous input distributions and is\noptimized for streaming data. Our method is even robust to nonstationary input\ndistributions and produces a barycenter estimate that tracks the input measures\nover time. The algorithm is semi-discrete, needing to discretize only the barycenter\nestimate. To the best of our knowledge, we also provide the \ufb01rst bounds on the\nquality of the approximate barycenter as the discretization becomes \ufb01ner. Finally,\nwe demonstrate the practical effectiveness of our method, both in tracking moving\ndistributions on a sphere, as well as in a large-scale Bayesian inference task.\n\n1\n\nIntroduction\n\nA key challenge when scaling up data aggregation occurs when data comes from multiple sources,\neach with its own inherent structure. Sensors in a sensor network may be con\ufb01gured differently or\nplaced far apart, but each individual sensor simply measures a different view of the same quantity.\nSimilarly, user data collected by a server in California will differ from that collected by a server in\nEurope: the data samples may be independent but are not identically distributed.\nOne reasonable approach to aggregation in the presence of multiple data sources is to perform\ninference on each piece independently and fuse the results. This is possible when the data can be\ndistributed randomly, using methods akin to distributed optimization [52, 53]. However, when the\ndata is not split in an i.i.d. way, Bayesian inference on different subsets of observed data yields\nslightly different \u201csubset posterior\u201d distributions for each subset that must be combined [33]. Further\ncomplicating matters, data sources may be nonstationary. How can we fuse these different data\nsources for joint analysis in a consistent and structure-preserving manner?\nWe address this question using ideas from the theory of optimal transport. Optimal transport gives us a\nprincipled way to measure distances between measures that takes into account the underlying space on\nwhich the measures are de\ufb01ned. Intuitively, the optimal transport distance between two distributions\nmeasures the amount of work one would have to do to move all mass from one distribution to the\nj=1, it is natural, in this setting, to ask for a measure \u232b that\nother. Given J input measures {\u00b5j}J\nminimizes the total squared distance to the input measures. This measure \u232b is called the Wasserstein\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fbarycenter of the input measures [1], and should be thought of as an aggregation of the input measures\nwhich preserves their geometry. This particular aggregation enjoys many nice properties: in the\nearlier Bayesian inference example, aggregating subset posterior distributions via their Wasserstein\nbarycenter yields guarantees on the original inference task [47].\nIf the measures \u00b5j are discrete, their barycenter can be computed relatively ef\ufb01ciently via either\na sparse linear program [2], or regularized projection-based methods [16, 7, 51, 17]. However, 1.\nthese techniques scale poorly with the support of the measures, and quickly become impractical\nas the support becomes large. 2. When the input measures are continuous, to the best of our\nknowledge the only option is to discretize them via sampling, but the rate of convergence to the true\n(continuous) barycenter is not well-understood. These two confounding factors make it dif\ufb01cult to\nutilize barycenters in scenarios like parallel Bayesian inference where the measures are continuous\nand a \ufb01ne approximation is needed. These are the primary issues we work to address in this paper.\nGiven sample access to J potentially continuous distributions \u00b5j, we propose a communication-\nef\ufb01cient, parallel algorithm to estimate their barycenter. Our method can be parallelized to J worker\nmachines, and the messages sent between machines are merely single integers. We require a discrete\napproximation only of the barycenter itself, making our algorithm semi-discrete, and our algorithm\nscales well to \ufb01ne approximations (e.g. n \u21e1 106). In contrast to previous work, we provide guarantees\non the quality of the approximation as n increases. These rates apply to the general setting in which\nthe \u00b5j\u2019s are de\ufb01ned on manifolds, with applications to directional statistics [46]. Our algorithm\nis based on stochastic gradient descent as in [22] and hence is robust to gradual changes in the\ndistributions: as the \u00b5j\u2019s change over time, we maintain a moving estimate of their barycenter, a task\nwhich is not possible using current methods without solving a large linear program in each iteration.\nWe emphasize that we aggregate the input distributions into a summary, the barycenter, which is itself\na distribution. Instead of performing any single domain-speci\ufb01c task such as clustering or estimating\nan expectation, we can simply compute the barycenter of the inputs and process it later any arbitrary\nway. This generality coupled with the ef\ufb01ciency and parallelism of our algorithm yields immediate\napplications in \ufb01elds from large scale Bayesian inference to e.g. streaming sensor fusion.\n\nContributions.\n1. We give a communication-ef\ufb01cient and fully parallel algorithm for computing\nthe barycenter of a collection of distributions. Although our algorithm is semi-discrete, we stress\nthat the input measures can be continuous, and even nonstationary. 2. We give bounds on the quality\nof the recovered barycenter as our discretization becomes \ufb01ner. These are the \ufb01rst such bounds we\nare aware of, and they apply to measures on arbitrary compact and connected manifolds. 3. We\ndemonstrate the practical effectiveness of our method, both in tracking moving distributions on a\nsphere, as well as in a real large-scale Bayesian inference task.\n\n1.1 Related work\nOptimal transport. A comprehensive treatment of optimal transport and its many applications is\nbeyond the scope of our work. We refer the interested reader to the detailed monographs by Villani\n[49] and Santambrogio [42]. Fast algorithms for optimal transport have been developed in recent\nyears via Sinkhorn\u2019s algorithm [15] and in particular stochastic gradient methods [22], on which\nwe build in this work. These algorithms have enabled several applications of optimal transport and\nWasserstein metrics to machine learning, for example in supervised learning [21], unsupervised\nlearning [34, 5], and domain adaptation [14]. Wasserstein barycenters in particular have been applied\nto a wide variety of problems including fusion of subset posteriors [47], distribution clustering [51],\nshape and texture interpolation [45, 40], and multi-target tracking [6].\nWhen the distributions \u00b5j are discrete, transport barycenters can be computed relatively ef\ufb01ciently via\neither a sparse linear program [2] or regularized projection-based methods [16, 7, 51, 17]. In settings\nlike posterior inference, however, the distributions \u00b5j are likely continuous rather than discrete, and\nthe most obvious viable approach requires discrete approximation of each \u00b5j. The resulting discrete\nbarycenter converges to the true, continuous barycenter as the approximations become \ufb01ner [10, 28],\nbut the rate of convergence is not well-understood, and \ufb01nely approximating each \u00b5j yields a very\nlarge linear program.\n\nScalable Bayesian inference. Scaling Bayesian inference to large datasets has become an important\ntopic in recent years. There are many approaches to this, ranging from parallel Gibbs sampling [38, 26]\n\n2\n\n\fto stochastic and streaming algorithms [50, 13, 25, 12]. For a more complete picture, we refer the\nreader to the survey by Angelino et al. [3].\nOne promising method is via subset posteriors: instead of sampling from the posterior distribution\ngiven by the full data, the data is split into smaller tractable subsets. Performing inference on each\nsubset yields several subset posteriors, which are biased but can be combined via their Wasserstein\nbarycenter [47], with provable guarantees on approximation quality. This is in contrast to other\nmethods that rely on summary statistics to estimate the true posterior [33, 36] and that require\nadditional assumptions. In fact, our algorithm works with arbitrary measures and on manifolds.\n\n2 Background\nLet (X , d) be a metric space. Given two probability measures \u00b5 2P (X ) and \u232b 2P (X ) and a cost\nfunction c : X\u21e5X! [0,1), the Kantorovich optimal transport problem asks for a solution to\n\nc(x, y)d(x, y) : 2 \u21e7(\u00b5, \u232b)\n\n(1)\nwhere \u21e7(\u00b5, \u232b) is the set of measures on the product space X\u21e5X whose marginals evaluate to \u00b5 and\n\u232b, respectively.\nUnder mild conditions on the cost function (lower semi-continuity) and the underlying space (com-\npleteness and separability), problem (1) admits a solution [42]. Moreover, if the cost function is\nof the form c(x, y) = d(x, y)p, the optimal transportation cost is a distance metric on the space of\nprobability measures. This is known as the Wasserstein distance and is given by\n\ninf\u21e2ZX\u21e5X\n\nWp(\u00b5, \u232b) =\u2713 inf\n\n2\u21e7(\u00b5,\u232b)ZX\u21e5X\n\nd(x, y)pd(x, y)\u25c61/p\n\n.\n\n(2)\n\nOptimal transport has recently attracted much attention in machine learning and adjacent commu-\nnities [21, 34, 14, 39, 41, 5]. When \u00b5 and \u232b are discrete measures, problem (2) is a linear program,\nalthough faster regularized methods based on Sinkhorn iteration are used in practice [15]. Optimal\ntransport can also be computed using stochastic \ufb01rst-order methods [22].\nNow let \u00b51, . . . , \u00b5J be measures on X . The Wasserstein barycenter problem, introduced by Agueh\nand Carlier [1], is to \ufb01nd a measure \u232b 2P (X ) that minimizes the functional\n\nF [\u232b] :=\n\n1\nJ\n\nJXj=1\n\nW 2\n\n2 (\u00b5j,\u232b ).\n\n(3)\n\nFinding the barycenter \u232b is the primary problem we address in this paper. When each \u00b5j is a\ndiscrete measure, the exact barycenter can be found via linear programming [2], and many of the\nregularization techniques apply for approximating it [16, 17]. However, the problem size grows\nquickly with the size of the support. When the measures \u00b5j are truly continuous, we are aware of\nonly one strategy: sample from each \u00b5j in order to approximate it by the empirical measure, and then\nsolve the discrete barycenter problem.\nWe directly address the problem of computing the barycenter when the input measures can be\ncontinuous. We solve a semi-discrete problem, where the target measure is a \ufb01nite set of points, but\nwe do not discretize any other distribution.\n\n3 Algorithm\n\nWe \ufb01rst provide some background on the dual formulation of optimal transport. Then we derive\na useful form of the barycenter problem, provide an algorithm to solve it, and prove convergence\nguarantees. Finally, we demonstrate how our algorithm can easily be parallelized.\n\n3.1 Mathematical preliminaries\nThe primal optimal transport problem (1) admits a dual problem [42]:\n\nOTc(\u00b5, \u232b) = sup\n\nv 1-Lipschitz{EY \u21e0\u232b[v(Y )] + EX\u21e0\u00b5[vc(X)]} ,\n\n(4)\n\n3\n\n\fdiscrete, problem (4) becomes the semi-discrete problem\n\nwhere vc(x) = inf y2X{c(x, y) v(y)} is the c-transform of v [49]. When \u232b = Pn\n(5)\nwhere we de\ufb01ne h(x, v) = vc(x) = mini=1,...,n{c(x, yi) vi}. Semi-discrete optimal transport\nadmits ef\ufb01cient algorithms [31, 29]; Genevay et al. [22] in particular observed that given sample\noracle access to \u00b5, the semi-discrete problem can be solved via stochastic gradient ascent. Hence\noptimal transport distances can be estimated even in the semi-discrete setting.\n\nv2Rn {hw, vi + EX\u21e0\u00b5[h(X, v)]} ,\n\nOTc(\u00b5, \u232b) = max\n\ni=1 wiyi is\n\n3.2 Deriving the optimization problem\nAbsolutely continuous measures can be approximated arbitrarily well by discrete distributions with\nrespect to Wasserstein distance [30]. Hence one natural approach to the barycenter problem (3) is to\napproximate the true barycenter via discrete approximation: we \ufb01x n support points {yi}n\ni=1 2X\nand search over assignments of the mass wi on each point yi. In this way we wish to \ufb01nd the discrete\ndistribution \u232bn =Pn\n\ni=1 wiyi with support on those n points which optimizes\n\nmin\nw2n\n\nF (w) = min\nw2n\n\nW 2\n\n2 (\u00b5j,\u232b n)\n\nwhere we have de\ufb01ned F (w) := F [\u232bn] = F [Pn\nequation (5). in Section 4, we discuss the effect of different choices for the support points {yi}n\nNoting that the variables vj are uncoupled, we can rearrange to get the following problem:\n\ni=1 wiyi] and used the dual formulation from\ni=1.\n\n1\nJ\n\nJXj=1\nJXj=1\n\n1\nJ\n\n= min\n\nw2n8<:\n\nmin\nw2n\n\nmax\nv1,...,vJ\n\n1\nJ\n\nmax\n\nvj2Rnhw, vji + EXj\u21e0\u00b5j [h(Xj, vj)] 9=;\nJXj=1\u21e5hw, vji + EXj\u21e0\u00b5j [h(Xj, vj)]\u21e4 .\n\nProblem (8) is convex in w and jointly concave in the vj, and we can compute an unbiased gradient\nestimate for each by sampling Xj \u21e0 \u00b5j. Hence, we could solve this saddle-point problem via\nsimultaneous (sub)gradient steps as in Nemirovski and Rubinstein [37]. Such methods are simple\nto implement, but in the current form we must project onto the simplex n at each iteration. This\nrequires only O(n log n) time [24, 32, 19] but makes it hard to decouple the problem across each\ndistribution \u00b5j. Fortunately, we can reformulate the problem in a way that avoids projection entirely.\nBy strong duality, Problem (8) can be written as\n\nJ\n\nmin\n\nmax\nv1,...,vJ\n\nw2n8<:\n* 1\nJXj=1\nv1,...,vJ8<:\ni 8<:\nJXj=1\n\nmin\n\n1\nJ\n\n1\nJ\n\nvj, w+ +\ni9=;\nJXj=1\n\n1\nJ\n\n+\n\nvj\n\nEXj\u21e0\u00b5j [h(Xj, vj)]9=;\nJXj=1\nEXj\u21e0\u00b5j [h(Xj, vj)]9=;\n\n.\n\n= max\n\nNote how the variable w disappears: for any \ufb01xed vector b, minimization of hb, wi over w 2 n is\nequivalent to \ufb01nding the minimum element of b. The optimal w can also be computed in closed form\nwhen the barycentric cost is entropically regularized as in [9], which may yield better convergence\nrates but requires dense updates that, e.g., need more communication in the parallel setting. In either\ncase, we are left with a concave maximization problem in v1, . . . , vJ, to which we can directly apply\nstochastic gradient ascent. Unfortunately the gradients are still not sparse and decoupled. We obtain\ni with a variable si\n\nsparsity after one \ufb01nal transformation of the problem: by replacing eachPJ\n\nand enforcing this equality with a constraint, we turn problem (10) into the constrained problem\n\nj=1 vj\n\nmax\n\ns,v1,...,vJ\n\n1\nJ\n\nJXj=1\uf8ff 1\n\nJ\n\nsi + EXj\u21e0\u00b5j [h(Xj, vj)]\n\nmin\n\ni\n\ns.t.\n\ns =\n\nvj.\n\nJXj=1\n\n(11)\n\n.\n\n(7)\n\n(6)\n\n(8)\n\n(9)\n\n(10)\n\n3.3 Algorithm and convergence\n\n4\n\n\fWe can now solve this problem via stochastic pro-\njected subgradient ascent. This is described in Al-\ngorithm 1; note that the sparse adjustments after the\ngradient step are actually projections onto the con-\nstraint set with respect to the `1 norm. Derivation\nof this sparse projection step is given rigorously in\nAppendix A. Not only do we have an optimization al-\ngorithm with sparse updates, but we can even recover\nthe optimal weights w from standard results in online\nlearning [20]. Speci\ufb01cally, in a zero-sum game where\none player plays a no-regret learning algorithm and\nthe other plays a best-response strategy, the average\nstrategies of both players converge to optimal:\nTheorem 3.1. Perform T iterations of stochastic\nsubgradient ascent on u = (s, v1, . . . , vJ ) as in\nAlgorithm 1, and use step size = R\n, assum-\n4pT\ning kut u\u21e4k1 \uf8ff R for all t. Let it be the\nminimizing index chosen at iteration t, and write\nwT = 1\n\nt=1 eit. Then we can bound\n\nT PT\n\nAlgorithm 1 Subgradient Ascent\n\ns, v1, . . . , vJ 0n\nloop\n\nDraw j \u21e0 Unif[1, . . . , J]\nDraw x \u21e0 \u00b5j\niW argmini{c(x, yi) vj\ni}\niM argmini si\nvj\niW vj\n. Gradient update\niW \nsiM siM + /J . Gradient update\nvj\niW vj\n. Projection\niW + /2\nvj\niM vj\n. Projection\niM + /(2J)\n. Projection\nsiW siW /2\n. Projection\nsiM siM /(2J)\n\nend loop\n\nE[F (wT ) F (w\u21e4)] \uf8ff 4R/pT .\n\n(12)\n\nThe expectation is with respect to the randomness in the subgradient estimates gt.\n\nTheorem 3.1 is proved in Appendix B. The proof combines the zero-sum game idea above, which\nitself comes from [20], with a regret bound for online gradient descent [54, 23].\n\n3.4 Parallel Implementation\nThe key realization which makes our barycenter algorithm truly scalable is that the variables\ns, v1, . . . , vJ can be separated across different machines. In particular, the \u201csum\u201d or \u201ccoupling\u201d\nvariable s is maintained on a master thread which runs Algorithm 2, and each vj is maintained on a\nworker thread running Algorithm 3. Each projected gradient step requires \ufb01rst selecting distribution j.\nThe algorithm then requires computing only iW = argmini{c(xj, yi) vj\ni} and iM = argmini si,\nand then updating s and vj in only those coordinates. Hence only a small amount of information (iW\nand iM) need pass between threads.\nNote also that this algorithm can be adapted to the parallel shared-memory case, where s is a variable\nshared between threads which make sparse updates to it. Here we will focus on the \ufb01rst master/worker\nscenario for simplicity.\nWhere are the bottlenecks? When there are n points in the discrete approximation, each worker\u2019s\ntask of computing argmini{c(xj, yi) vj\ni} requires O(n) computations of c(x, y). The master must\niteratively \ufb01nd the minimum element siM in the vector s, then update siM , and decrease element siW .\nThese can be implemented respectively as the \u201c\ufb01nd min\u201d, \u201cdelete min\u201d then \u201cinsert,\u201d and \u201cdecrease\nmin\u201d operations in a Fibonacci heap. All these operations together take amortized O(log n) time.\nHence, it takes O(n) time it for all J workers to each produce one gradient sample in parallel, and\nonly O(J log n) time for the master to process them all. Of course, communication is not free, but\nthe messages are small and our approach should scale well for J \u2327 n.\nThis parallel algorithm is particularly well-suited to the Wasserstein posterior (WASP) [48] framework\nfor merging Bayesian subset posteriors. In this setting, we split the dataset X1, . . . , Xk into J subsets\nS1, . . . , SJ each with k/J data points, distribute those subsets to J different machines, then each\nmachine runs Markov Chain Monte Carlo (MCMC) to sample from p(\u2713|Si), and we aggregate\nthese posteriors via their barycenter. The most expensive subroutine in the worker thread is actually\nsampling from the posterior, and everything else is cheap in comparison. In particular, the machines\nneed not even share samples from their respective MCMC chains.\nOne subtlety is that selecting worker j truly uniformly at random each iteration requires more\nsynchronization, hence our gradient estimates are not actually independent as usual. Selecting worker\nthreads as they are available will fail to yield a uniform distribution over j, as at the moment worker\n\n5\n\n\fj \ufb01nishes one gradient step, the probability that worker j is the next available is much less than 1/J:\nworker j must resample and recompute iW , whereas other threads would have a head start. If workers\nall took precisely the same amount of time, the ordering of worker threads would be determinstic, and\nguarantees for without-replacement sampling variants of stochastic gradient ascent would apply [44].\nIn practice, we have no issues with our approach.\n\n4 Consistency\n\nAlgorithm 2 Master Thread\n\nend loop\n\nreturn w c/(Pn\n\ni=1 ci)\n\nAlgorithm 3 Worker Thread\n\niW message from worker j\nSend iM to worker j\nciM ciM + 1\nsiM siM + /(2J)\nsiW siW /2\niM argmini si\n\nInput: index j, distribution \u00b5, atoms\n{yi}i=1,...,N, number J of distribu-\ntions, step size \nOutput: barycenter weights w\nc 0n\ns 0n\niM 1\nloop\n\nPrior methods for estimating the Wasserstein barycenter\n\u232b\u21e4 of continuous measures \u00b5j 2P (X ) involve \ufb01rst ap-\nproximating each \u00b5j by a measure \u00b5j,n that has \ufb01nite\nsupport on n points, then computing the barycenter \u232b\u21e4n of\n{\u00b5j,n} as a surrogate for \u232b\u21e4. This approach is consistent,\nin that if \u00b5j,n ! \u00b5j as n ! 1, then also \u232b\u21e4n ! \u232b\u21e4.\nThis holds even if the barycenter is not unique, both in the\nEuclidean case [10, Theorem 3.1] as well as when X is\na Riemannian manifold [28, Theorem 5.4]. However, it\nis not known how fast the approximation \u232b\u21e4n approaches\nthe true barycenter \u232b\u21e4, or even how fast the barycentric\ndistance F [\u232b\u21e4n] approaches F [\u232bn].\nIn practice, not even the approximation \u232b\u21e4n is computed\nexactly: instead, support points are chosen and \u232b\u21e4n is con-\nstrained to have support on those points. There are various\nheuristic methods for choosing these support points, rang-\ning from mesh grids of the support, to randomly sampling\npoints from the convex hull of the supports of \u00b5j , or even\noptimizing over the support point locations. Yet we are\nunaware of any rigorous guarantees on the quality of these\napproximations.\nWhile our approach still involves approximating the\nbarycenter \u232b\u21e4 by a measure \u232b\u21e4n with \ufb01xed support, we\nare able to provide bounds on the quality of this approx-\nimation as n ! 1. Speci\ufb01cally, we bound the rate at\nwhich F [\u232b\u21e4n] ! F [\u232bn]. The result is intuitive, and appeals\nto the notion of an \u270f-cover of the support of the barycenter:\nDe\ufb01nition 4.1 (Covering Number). The \u270f-covering num-\nber of a compact set K \u21e2X , with respect to the metric g,\nis the minimum number N\u270f(K) of points {xi}N\u270f(K)\n2 K\nneeded so that for each y 2 K, there is some xi with\ng(xi, y) \uf8ff \u270f. The set {xi} is called an \u270f-covering.\nDe\ufb01nition 4.2 (Inverse Covering Radius). Fix n 2 Z+.\nWe de\ufb01ne the n-inverse covering radius of compact K \u21e2X as the value \u270fn(K) = inf{\u270f> 0 :\nN\u270f(K) \uf8ff n}, when n is large enough so the in\ufb01mum exists.\nSuppose throughout this section that K \u21e2 Rd is endowed with a Riemannian metric g, where K has\ndiameter D. In the speci\ufb01c case where g is the usual Euclidean metric, there is an \u270f-cover for K with\nat most C1\u270fd points, where C1 depends only on the diameter D and dimension d [43]. Reversing\nthe inequality, K has an n-inverse covering radius of at most \u270f \uf8ff C2n1/d when n takes the correct\nform.\nWe now present and then prove our main result:\nTheorem 4.1. Suppose the measures \u00b5j are supported on K, and suppose \u00b51 is absolutely continuous\nwith respect to volume. Then the barycenter \u232b\u21e4 is unique. Moreover, for each empirical approximation\nsize n, if we choose support points {yi}i=1,...,n that constitute a 2\u270fn(K)-cover of K, it follows that\nF [\u232b\u21e4n] F [\u232b\u21e4] \uf8ff O(\u270fn(K) + n1/d), where \u232b\u21e4n =Pn\nRemark 4.1. Absolute continuity is only needed to reason about approximating the barycenter\nwith an N point discrete distribution. If the input distributions are themselves discrete distributions,\n\nDraw x \u21e0 \u00b5\niW argmini{c(x, yi) vi}\nSend iW to master\niM message from master\nviM viM + /(2J)\nviW viW /2\n\nInput: index j, distribution \u00b5, atoms\n{yi}i=1,...,N, number J of distribu-\ntions, step size \nv 0n\nloop\n\ni=1\n\nend loop\n\ni=1 w\u21e4i yi for w\u21e4 solving Problem (8).\n\n6\n\n\fi=1 w\u21e4i yi converges weakly to \u232b\u21e4.\n\nso is the barycenter, and we can strengthen our result. For large enough n, we actually have\nW2(\u232b\u21e4n,\u232b \u21e4) \uf8ff 2\u270fn(K) and therefore F [\u232b\u21e4n] F [\u232b\u21e4] \uf8ff O(\u270fn(K)).\nCorollary 4.1 (Convergence to \u232b\u21e4). Suppose the measures \u00b5j are supported on K, with \u00b51 absolutely\ncontinuous with respect to volume. Let \u232b\u21e4 be the unique minimizer of F . Then we can choose support\npoints {yi}i=1,...,n such that some subsequence of \u232b\u21e4n =Pn\nProof. By Theorem 4.1, we can choose support points so that F [\u232b\u21e4n] ! F [\u232b\u21e4]. By compactness, the\nsequence \u232b\u21e4n admits a convergent subsequence \u232b\u21e4nk ! \u232b for some measure \u232b. Continuity of F allows\nus to pass to the limit limk!1 F [\u232b\u21e4nk ] = F [limk!1 \u232b\u21e4nk ]. On the other hand, limk!1 F [\u232b\u21e4nk ] =\nF [\u232b\u21e4], and F is strictly convex [28], thus \u232b\u21e4nk ! \u232b\u21e4 weakly.\nBefore proving Theorem 4.1, we need smoothness of the barycenter functional F with respect to\nWasserstein-2 distance:\nLemma 4.1. Suppose we are given measures {\u00b5j}J\n\u232bn ! \u232b. Then, F [\u232bn] ! F [\u232b], with |F [\u232bn] F [\u232b]|\uf8ff 2D \u00b7 W2(\u232bn,\u232b ).\nProof of Theorem 4.1. Uniqueness of \u232b\u21e4 follows from Theorem 2.4 of [28]. From Theorem 5.1\nin [28] we know further that \u232b\u21e4 is absolutely continuous with respect to volume.\nLet N > 0, and let \u232bN be the discrete distribution on N points, each with mass 1/N, which minimizes\nW2(\u232bN ,\u232b \u21e4). This distribution satis\ufb01es W2(\u232bN ,\u232b \u21e4) \uf8ff CN1/d [30], where C depends on K, the\ndimension d, and the metric. With our \u201cbudget\u201d of n support points, we can construct a 2\u270fn(K)-cover\nas long as n is suf\ufb01ciently large. Then de\ufb01ne a distribution \u232bn,N with support on the 2\u270fn(K)-cover\nas follows: for each x in the support of \u232bN, map x to the closest point x0 in the cover, and add mass\n1/N to x0. Note that this de\ufb01nes not only the distribution \u232bn,N, but also a transport plan between \u232bN\nand \u232bn,N. This map moves N points of mass 1/N each a distance at most 2\u270fn(K), so we may bound\n\nj=1, \u232b, and {\u232bn}1n=1 supported on K, with\n\nW2(\u232bn,N ,\u232b N ) \uf8ffpN \u00b7 1/N \u00b7 (2\u270fn(K))2 = 2\u270fn(K). Combining these two bounds, we see that\n\nW2(\u232bn,N ,\u232b \u21e4) \uf8ff W2(\u232bn,N ,\u232b N ) + W2(\u232bN ,\u232b \u21e4)\n\n(13)\n(14)\n\n\uf8ff 2\u270fn(K) + CN1/d.\n\nFor each n, we choose to set N = n, which yields W2(\u232bn,n,\u232b \u21e4) \uf8ff 2\u270fn(K) + Cn1/d. Applying\nLemma 4.1, and recalling that \u232b\u21e4 is the minimizer of J, we have\n(15)\nHowever, we must have F [\u232b\u21e4n] \uf8ff F [\u232bn,n], because both are measures on the same n point 2\u270fn(K)-\ncover, but \u232b\u21e4n has weights chosen to minimize J. Thus we must also have\n\nF [\u232bn,n] F [\u232b\u21e4] \uf8ff 2D \u00b7 (2\u270fn(K) + Cn1/d) = O(\u270fn(K) + n1/d).\n\nF [\u232b\u21e4n] F [\u232b\u21e4] \uf8ff F [\u232bn,n] F [\u232b\u21e4] \uf8ff O(\u270fn(K) + n1/d).\n\nThe high-level view of the above result is that choosing support points yi to form an \u270f-cover with\nrespect to the metric g, and then optimizing over their weights wi via our stochastic algorithm, will\ngive us a consistent picture of the behavior of the true barycenter. Also note that the proof above\nrequires an \u270f-cover only of the support of v\u21e4, not all of K. In particular, an \u270f-cover of the convex hull\nof the supports of \u00b5j is suf\ufb01cient, as this must contain the barycenter. Other heuristic techniques to\nef\ufb01ciently focus a limited budget of n points only on the support of \u232b\u21e4 are advantageous and justi\ufb01ed.\nWhile Theorem 4.1 is a good start, ideally we would also be able to provide a bound on W2(\u232b\u21e4n,\u232b \u21e4).\nThis would follow readily from sharpness of the functional F [\u232b], or even the discrete version F (w),\nbut it is not immediately clear how to achieve such a result.\n\n5 Experiments\n\nWe demonstrate the applicability of our method on two experiments, one synthetic and one per-\nforming a real inference task. Together, these showcase the positive traits of our algorithm: speed,\nparallelization, robustness to non-stationarity, applicability to non-Euclidean domains, and immediate\nperformance bene\ufb01t to Bayesian inference. We implemented our algorithm in C++ using MPI, and our\ncode is posted at github.com/mstaib/stochastic-barycenter-code. Full experiment details\nare given in Appendix D.\n\n7\n\n\fFigure 1: The Wasserstein barycenter of four von Mises-Fisher distributions on the unit sphere S2.\nFrom left to right, the \ufb01gures show the initial distributions merging into the Wasserstein barycenter.\nAs the input distributions are moved along parallel paths on the sphere, the barycenter accurately\ntracks the new locations as shown in the \ufb01nal three \ufb01gures.\n\n5.1 Von Mises-Fisher Distributions with Drift\n\nWe demonstrate computation and tracking of the barycenter of four drifting von Mises-Fisher\ndistributions on the unit sphere S2. Note that W2 and the barycentric cost are now de\ufb01ned with\nrespect to geodesic distance on S2.\nThe distributions are randomly centered, and we move the center of each distribution 3\u21e5105 radians\n(in the same direction for all distributions) each time a sample is drawn. A snapshot of the results is\nshown in Figure 1. Our algorithm is clearly able to track the barycenter as the distributions move.\n\n5.2 Large Scale Bayesian Inference\n\n25\n\n50\n\n45\n\n40\n\n35\n\n30\n\n100\n\nWe run logistic regression on the UCI skin segmentation\ndataset [8]. The 245057 datapoints are colors represented\nin R3, each with a binary label determing whether that\ncolor is a skin color. We split consecutive blocks of the\ndataset into 127 subsets, and due to locality in the dataset,\nthe data in each subsets is not identically distributed. Each\nsubset is assigned one thread of an In\ufb01niBand cluster on\nwhich we simultaneously sample from the subset posterior\nvia MCMC and optimize the barycenter estimate. This is\nin contrast to [47], where the barycenter can be computed\nvia a linear program (LP) only after all samplers are run.\nSince the full dataset is tractable, we can compare the two\nmethods via W2 distance to the posterior of the full dataset,\nwhich we can estimate via the large-scale optimal transport\nalgorithm in [22] or by LP depending on the support size.\nFor each method, we \ufb01x n barycenter support points on a\nmesh determined by samples from the subset posteriors.\nAfter 317 seconds, or about 10000 iterations per subset\nposterior, our algorithm has produced a barycenter on\nn \u21e1 104 support points with W2 distance about 26 from\nthe full posterior. Similarly competitive results hold even\nfor n \u21e1 105 or 106, though tuning the stepsize becomes more challenging. Even in the 106 case,\nno individual 16 thread node used more than 2GB of memory. For n \u21e1 104, over a wide range of\nstepsizes we can in seconds approximate the full posterior better than is possible with the LP as seen\nin Figure 2 by terminating early.\nIn comparsion, in Table 1 we attempt to compute the barycenter LP as in [47] via Mosek [4],\nfor varying values of n. Even n = 480 is not possible on a system with 16GB of memory, and\nfeasible values of n result in meshes too sparse to accurately and reliably approximate the barycenter.\nSpeci\ufb01cally, there are several cases where n increases but the approximation quality actually decreases:\nthe subset posteriors are spread far apart, and the barycenter is so small relative to the required\nbounding box that likely only one grid point is close to it, and how close this grid point is depends on\nthe speci\ufb01c mesh. To avoid this behavior, one must either use a dense grid (our approach), or invent a\nbetter method for choosing support points that will still cover the barycenter. In terms of compute\ntime, entropy regularized methods may have faired better than the LP for \ufb01ner meshes but would still\n\nFigure 2: Convergence of our algorithm\nwith n \u21e1 104 for different stepsizes. In\neach case we recover a better approxima-\ntion than what was possible with the LP\nfor any n, in as little as \u21e1 30 seconds.\n\n150\n\n200\n\n250\n\n300\n\n8\n\n\fTable 1: Number of support points n versus computation time and W2 distance to the true posterior.\nCompared to prior work, our algorithm handles much \ufb01ner meshes, producing much better estimates.\n\nLinear program from [47]\n\nn\ntime (s)\nW2\n\n24\n0.5\n41.1\n\n40\n0.97\n59.3\n\n60\n2.9\n50.0\n\n84\n6.1\n34.3\n\n189\n34\n44.3\n\n320\n163\n53.7\n\n396\n176\n45\n\n480\nout of memory\nout of memory\n\nThis paper\n104\n317\n26.3\n\nnot give the same result as our method. Note also that the LP timings include only optimization time,\nwhereas in 317 seconds our algorithm produces samples and optimizes.\n\n6 Conclusion and Future Directions\n\nWe have proposed an original algorithm for computing the Wasserstein barycenter of arbitrary\nmeasures given a stream of samples. Our algorithm is communication-ef\ufb01cient, highly parallel,\neasy to implement, and enjoys consistency results that, to the best of our knowledge, are new. Our\nmethod has immediate impact on large-scale Bayesian inference and sensor fusion tasks: for Bayesian\ninference in particular, we obtain far \ufb01ner estimates of the Wasserstein-averaged subset posterior\n(WASP) [47] than was possible before, enabling faster and more accurate inference.\nThere are many directions for future work: we have barely scratched the surface in terms of new\napplications of large-scale Wasserstein barycenters, and there are still many possible algorithmic\nimprovements. One implication of Theorem 3.1 is that a faster algorithm for solving the concave\nproblem (11) immediately yields faster convergence to the barycenter. Incorporating variance reduc-\ntion [18, 27] is a promising direction, provided we maintain communication-ef\ufb01ciency. Recasting\nproblem (11) as distributed consensus optimization [35, 11] would further help scale up the barycenter\ncomputation to huge numbers of input measures.\n\nAcknowledgements We thank the anonymous reviewers for their helpful suggestions. We also\nthank MIT Supercloud and the Lincoln Laboratory Supercomputing Center for providing compu-\ntational resources. M. Staib acknowledges Government support under and awarded by DoD, Air\nForce Of\ufb01ce of Scienti\ufb01c Research, National Defense Science and Engineering Graduate (NDSEG)\nFellowship, 32 CFR 168a. J. Solomon acknowledges funding from the MIT Research Support Com-\nmittee (\u201cStructured Optimization for Geometric Problems\u201d), as well as Army Research Of\ufb01ce grant\nW911NF-12-R-0011 (\u201cSmooth Modeling of Flows on Graphs\u201d). This research was supported by\nNSF CAREER award 1553284 and The Defense Advanced Research Projects Agency (grant number\nN66001-17-1-4039). The views, opinions, and/or \ufb01ndings contained in this article are those of the\nauthor and should not be interpreted as representing the of\ufb01cial views or policies, either expressed or\nimplied, of the Defense Advanced Research Projects Agency or the Department of Defense.\n\nReferences\n[1] M. Agueh and G. Carlier. Barycenters in the Wasserstein Space. SIAM J. Math. Anal., 43(2):904\u2013924,\n\nJanuary 2011. ISSN 0036-1410. doi: 10.1137/100805741.\n\n[2] Ethan Anderes, Steffen Borgwardt, and Jacob Miller. Discrete Wasserstein barycenters: Optimal transport\nfor discrete data. Math Meth Oper Res, 84(2):389\u2013409, October 2016. ISSN 1432-2994, 1432-5217. doi:\n10.1007/s00186-016-0549-x.\n\n[3] Elaine Angelino, Matthew James Johnson, and Ryan P. Adams. Patterns of scalable bayesian inference.\nFoundations and Trends R in Machine Learning, 9(2-3):119\u2013247, 2016. ISSN 1935-8237. doi: 10.1561/\n2200000052. URL http://dx.doi.org/10.1561/2200000052.\n\n[4] MOSEK ApS. The MOSEK optimization toolbox for MATLAB manual. Version 8.0.0.53., 2017. URL\n\nhttp://docs.mosek.com/8.0/toolbox/index.html.\n\n[5] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN. 2017.\n\n[6] M. Baum, P. K. Willett, and U. D. Hanebeck. On Wasserstein Barycenters and MMOSPA Estimation. IEEE\nSignal Process. Lett., 22(10):1511\u20131515, October 2015. ISSN 1070-9908. doi: 10.1109/LSP.2015.2410217.\n\n9\n\n\f[7] J. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr\u00e9. Iterative Bregman Projections for Regularized\nTransportation Problems. SIAM J. Sci. Comput., 37(2):A1111\u2013A1138, January 2015. ISSN 1064-8275.\ndoi: 10.1137/141000439.\n\n[8] Rajen Bhatt and Abhinav Dhall. Skin segmentation dataset. UCI Machine Learning Repository.\n\n[9] J\u00e9r\u00e9mie Bigot, Elsa Cazelles, and Nicolas Papadakis. Regularization of barycenters in the Wasserstein\n\nspace. arXiv:1606.01025 [math, stat], June 2016.\n\n[10] Emmanuel Boissard, Thibaut Le Gouic, and Jean-Michel Loubes. Distribution\u2019s template estimate with\nWasserstein metrics. Bernoulli, 21(2):740\u2013759, May 2015. ISSN 1350-7265. doi: 10.3150/13-BEJ585.\n\n[11] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization\nand statistical learning via the alternating direction method of multipliers. Foundations and Trends R in\nMachine Learning, 3(1):1\u2013122, 2011.\n\n[12] Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan. Streaming\nVariational Bayes. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,\neditors, Advances in Neural Information Processing Systems 26, pages 1727\u20131735. Curran Associates,\nInc., 2013.\n\n[13] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In Eric P.\nXing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine Learning,\nvolume 32 of Proceedings of Machine Learning Research, pages 1683\u20131691, Bejing, China, 22\u201324 Jun\n2014. PMLR. URL http://proceedings.mlr.press/v32/cheni14.html.\n\n[14] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal Transport for Domain Adaptation.\nIEEE Trans. Pattern Anal. Mach. Intell., PP(99):1\u20131, 2016. ISSN 0162-8828. doi: 10.1109/TPAMI.2016.\n2615921.\n\n[15] Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. In C. J. C. Burges,\nL. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information\nProcessing Systems 26, pages 2292\u20132300. Curran Associates, Inc., 2013.\n\n[16] Marco Cuturi and Arnaud Doucet. Fast Computation of Wasserstein Barycenters. pages 685\u2013693, 2014.\n\n[17] Marco Cuturi and Gabriel Peyr\u00e9. A Smoothed Dual Approach for Variational Wasserstein Problems. SIAM\n\nJ. Imaging Sci., 9(1):320\u2013343, January 2016. doi: 10.1137/15M1032600.\n\n[18] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives. In Advances in Neural Information Processing\nSystems, pages 1646\u20131654, 2014.\n\n[19] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Ef\ufb01cient projections onto the l\n1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine\nLearning, pages 272\u2013279. ACM, 2008.\n\n[20] Yoav Freund and Robert E. Schapire. Adaptive Game Playing Using Multiplicative Weights. Games and\n\nEconomic Behavior, 29(1):79\u2013103, October 1999. ISSN 0899-8256. doi: 10.1006/game.1999.0738.\n\n[21] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning\nwith a Wasserstein Loss. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 28, pages 2053\u20132061. Curran Associates, Inc., 2015.\n\n[22] Aude Genevay, Marco Cuturi, Gabriel Peyr\u00e9, and Francis Bach. Stochastic Optimization for Large-scale\nOptimal Transport. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances\nin Neural Information Processing Systems 29, pages 3440\u20133448. Curran Associates, Inc., 2016.\n\n[23] Elad Hazan. Introduction to Online Convex Optimization. OPT, 2(3-4):157\u2013325, August 2016. ISSN\n\n2167-3888, 2167-3918. doi: 10.1561/2400000013.\n\n[24] Michael Held, Philip Wolfe, and Harlan P. Crowder. Validation of subgradient optimization. Mathematical\n\nProgramming, 6(1):62\u201388, December 1974. ISSN 0025-5610, 1436-4646. doi: 10.1007/BF01580223.\n\n[25] Matthew D Hoffman, David M Blei, Chong Wang, and John William Paisley. Stochastic variational\n\ninference. Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n10\n\n\f[26] Matthew Johnson, James Saunderson, and Alan Willsky.\n\nsian gibbs\nand K. Q. Weinberger,\npages 2715\u20132723. Curran Associates,\n5043-analyzing-hogwild-parallel-gaussian-gibbs-sampling.pdf.\n\nAnalyzing hogwild parallel gaus-\nJ. C. Burges, L. Bottou, M. Welling, Z. Ghahramani,\nInformation Processing Systems 26,\nURL http://papers.nips.cc/paper/\n\nin Neural\n\nInc., 2013.\n\nsampling.\n\nIn C.\neditors, Advances\n\n[27] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\nIn Advances in Neural Information Processing Systems, pages 315\u2013323, 2013.\n\n[28] Young-Heon Kim and Brendan Pass. Wasserstein barycenters over Riemannian manifolds. Advances in\n\nMathematics, 307:640\u2013683, February 2017. ISSN 0001-8708. doi: 10.1016/j.aim.2016.11.026.\n\n[29] Jun Kitagawa, Quentin M\u00e9rigot, and Boris Thibert. Convergence of a Newton algorithm for semi-discrete\n\noptimal transport. arXiv:1603.05579 [cs, math], March 2016.\n\n[30] Beno\u00eet Kloeckner. Approximation by \ufb01nitely supported measures. ESAIM Control Optim. Calc. Var., 18\n\n(2):343\u2013359, 2012. ISSN 1292-8119.\n\n[31] Bruno L\u00e9vy. A Numerical Algorithm for L2 Semi-Discrete Optimal Transport in 3D. ESAIM Math. Model.\nNumer. Anal., 49(6):1693\u20131715, November 2015. ISSN 0764-583X, 1290-3841. doi: 10.1051/m2an/\n2015055.\n\n[32] C. Michelot. A \ufb01nite algorithm for \ufb01nding the projection of a point onto the canonical simplex of /n. J\nOptim Theory Appl, 50(1):195\u2013200, July 1986. ISSN 0022-3239, 1573-2878. doi: 10.1007/BF00938486.\n\n[33] Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David Dunson. Scalable and Robust Bayesian\n\nInference via the Median Posterior. In PMLR, pages 1656\u20131664, January 2014.\n\n[34] Gr\u00e9goire Montavon, Klaus-Robert M\u00fcller, and Marco Cuturi. Wasserstein Training of Restricted Boltzmann\nMachines. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 29, pages 3718\u20133726. Curran Associates, Inc., 2016.\n\n[35] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE\n\nTransactions on Automatic Control, 54(1):48\u201361, 2009.\n\n[36] Willie Neiswanger, Chong Wang, and Eric P. Xing. Asymptotically exact, embarrassingly parallel\nmcmc. In Proceedings of the Thirtieth Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201914,\npages 623\u2013632, Arlington, Virginia, United States, 2914. AUAI Press. ISBN 978-0-9749039-1-0. URL\nhttp://dl.acm.org/citation.cfm?id=3020751.3020816.\n\n[37] Arkadi Nemirovski and Reuven Y. Rubinstein. An Ef\ufb01cient Stochastic Approximation Algorithm for\nStochastic Saddle Point Problems.\nIn Moshe Dror, Pierre L\u2019Ecuyer, and Ferenc Szidarovszky, edi-\ntors, Modeling Uncertainty, number 46 in International Series in Operations Research & Manage-\nment Science, pages 156\u2013184. Springer US, 2005. ISBN 978-0-7923-7463-3 978-0-306-48102-4. doi:\n10.1007/0-306-48102-2_8.\n\n[38] David Newman, Padhraic Smyth, Max Welling, and Arthur U. Asuncion. Distributed inference for latent\ndirichlet allocation. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Infor-\nmation Processing Systems 20, pages 1081\u20131088. Curran Associates, Inc., 2008. URL http://papers.\nnips.cc/paper/3330-distributed-inference-for-latent-dirichlet-allocation.pdf.\n\n[39] Gabriel Peyr\u00e9, Marco Cuturi, and Justin Solomon. Gromov-Wasserstein Averaging of Kernel and Distance\n\nMatrices. In PMLR, pages 2664\u20132672, June 2016.\n\n[40] Julien Rabin, Gabriel Peyr\u00e9, Julie Delon, and Marc Bernot. Wasserstein Barycenter and Its Application to\nTexture Mixing. In Scale Space and Variational Methods in Computer Vision, pages 435\u2013446. Springer,\nBerlin, Heidelberg, May 2011. doi: 10.1007/978-3-642-24785-9_37.\n\n[41] Antoine Rolet, Marco Cuturi, and Gabriel Peyr\u00e9. Fast Dictionary Learning with a Smoothed Wasserstein\n\nLoss. In PMLR, pages 630\u2013638, May 2016.\n\n[42] Filippo Santambrogio. Optimal Transport for Applied Mathematicians, volume 87 of Progress in Nonlinear\nDifferential Equations and Their Applications. Springer International Publishing, Cham, 2015. ISBN\n978-3-319-20827-5 978-3-319-20828-2. doi: 10.1007/978-3-319-20828-2.\n\n[43] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms.\n\nCambridge university press, 2014.\n\n11\n\n\f[44] Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In D. D. Lee, M. Sugiyama,\nU. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Sys-\ntems 29, pages 46\u201354. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/\n6245-without-replacement-sampling-for-stochastic-gradient-methods.pdf.\n\n[45] Justin Solomon, Fernando de Goes, Gabriel Peyr\u00e9, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du,\nand Leonidas Guibas. Convolutional Wasserstein Distances: Ef\ufb01cient Optimal Transportation on Geometric\nDomains. ACM Trans Graph, 34(4):66:1\u201366:11, July 2015. ISSN 0730-0301. doi: 10.1145/2766963.\n\n[46] Suvrit Sra. Directional Statistics in Machine Learning: A Brief Review. arXiv:1605.00316 [stat], May\n\n2016.\n\n[47] Sanvesh Srivastava, Volkan Cevher, Quoc Dinh, and David Dunson. WASP: Scalable Bayes via barycenters\nof subset posteriors. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 38 of Proceedings of Machine\nLearning Research, pages 912\u2013920, San Diego, California, USA, 09\u201312 May 2015. PMLR. URL http:\n//proceedings.mlr.press/v38/srivastava15.html.\n\n[48] Sanvesh Srivastava, Cheng Li, and David B. Dunson. Scalable Bayes via Barycenter in Wasserstein Space.\n\narXiv:1508.05880 [stat], August 2015.\n\n[49] C\u00e9dric Villani. Optimal Transport: Old and New. Number 338 in Grundlehren der mathematischen\n\nWissenschaften. Springer, Berlin, 2009. ISBN 978-3-540-71049-3. OCLC: ocn244421231.\n\n[50] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings\n\nof the 28th International Conference on Machine Learning (ICML-11), pages 681\u2013688, 2011.\n\n[51] J. Ye, P. Wu, J. Z. Wang, and J. Li. Fast Discrete Distribution Clustering Using Wasserstein Barycenter\nWith Sparse Support. IEEE Trans. Signal Process., 65(9):2317\u20132332, May 2017. ISSN 1053-587X. doi:\n10.1109/TSP.2017.2659647.\n\n[52] Yuchen Zhang, John C Duchi, and Martin J Wainwright. Communication-ef\ufb01cient algorithms for statistical\n\noptimization. Journal of Machine Learning Research, 14:3321\u20133363, 2013.\n\n[53] Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: A\ndistributed algorithm with minimax optimal rates. Journal of Machine Learning Research, 16:3299\u20133340,\n2015. URL http://jmlr.org/papers/v16/zhang15d.html.\n\n[54] Martin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In Proceed-\n\nings of the 20th International Conference on Machine Learning (ICML-03), pages 928\u2013936, 2003.\n\n12\n\n\f", "award": [], "sourceid": 1521, "authors": [{"given_name": "Matthew", "family_name": "Staib", "institution": "MIT"}, {"given_name": "Sebastian", "family_name": "Claici", "institution": "MIT"}, {"given_name": "Justin", "family_name": "Solomon", "institution": "MIT"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "MIT"}]}