{"title": "Large-scale optimal transport map estimation using projection pursuit", "book": "Advances in Neural Information Processing Systems", "page_first": 8118, "page_last": 8129, "abstract": "This paper studies the estimation of large-scale optimal transport maps (OTM), which is a well known challenging problem owing to the curse of dimensionality.\nExisting literature approximates the large-scale OTM by a series of one-dimensional OTM problems through iterative random projection.\nSuch methods, however, suffer from slow or none convergence in practice due to the nature of randomly selected projection directions. \nInstead, we propose an estimation method of large-scale OTM by combining the idea of projection pursuit regression and sufficient dimension reduction. \nThe proposed method, named projection pursuit Monge map (PPMM), adaptively selects the most ``informative'' projection direction in each iteration. \nWe theoretically show the proposed dimension reduction method can consistently estimate the most ``informative'' projection direction in each iteration. \nFurthermore, the PPMM algorithm weakly convergences to the target large-scale OTM in a reasonable number of steps. \nEmpirically, PPMM is computationally easy and converges fast. \nWe assess its finite sample performance through the applications of Wasserstein distance estimation and generative models.", "full_text": "Large-scale optimal transport map estimation using\n\nprojection pursuit\n\nCheng Meng1 Yuan Ke1 Jingyi Zhang1 Mengrui Zhang1 Wenxuan Zhong1 Ping Ma1\n\n{cheng.meng25, yuan.ke, jingyi.zhang25, mengrui.zhang, wenxuan, pingma }@uga.edu\n\n1Department of Statistics, University of Georgia\n\nAbstract\n\nThis paper studies the estimation of large-scale optimal transport maps (OTM),\nwhich is a well known challenging problem owing to the curse of dimensionality.\nExisting literature approximates the large-scale OTM by a series of one-dimensional\nOTM problems through iterative random projection. Such methods, however, suffer\nfrom slow or none convergence in practice due to the nature of randomly selected\nprojection directions. Instead, we propose an estimation method of large-scale OTM\nby combining the idea of projection pursuit regression and suf\ufb01cient dimension\nreduction. The proposed method, named projection pursuit Monge map (PPMM),\nadaptively selects the most \u201cinformative\u201d projection direction in each iteration.\nWe theoretically show the proposed dimension reduction method can consistently\nestimate the most \u201cinformative\u201d projection direction in each iteration. Furthermore,\nthe PPMM algorithm weakly convergences to the target large-scale OTM in a\nreasonable number of steps. Empirically, PPMM is computationally easy and\nconverges fast. We assess its \ufb01nite sample performance through the applications of\nWasserstein distance estimation and generative models.\n\n1\n\nIntroduction\n\nRecently, optimal transport map (OTM) draws great attention in machine learning, statistics, and\ncomputer science due to its close relationship to generative models, including generative adversarial\nnets [19], the \u201cdecoder\u201d network in variational autoencoders [27], among others. In a generative\nmodel, the goal is usually to generate a \u201cfake\u201d sample, which is indistinguishable from the genuine\none. This is equivalent to \ufb01nd a transport map \u03c6 from random noises with distribution pX (e.g.,\nGaussian distribution or uniform distribution) to the underlying population distribution pY of the\ngenuine sample, e.g., the MNIST or the ImageNet dataset. Nowadays, generative models have been\nwidely-used for generating realistic images [12, 33], songs [4, 13] and videos [32, 53]. Besides\ngenerative models, OTM also plays essential roles in various machine learning applications, say color\ntransfer [14, 41], shape match [50], transfer learning [10, 38] and natural language processing [38].\nDespite its impressive performance, the computation of OTM is challenging for a large-scale sample\nwith massive sample size and/or high dimensionality. Traditional methods for estimating the OTM\nincludes \ufb01nding a parametric map and using ordinary differential equations [8, 2]. To address the\ncomputational concern, recent developments of OTM estimation have been made based on solving\nlinear programs [44, 37]. Let {xi}n\ni=1 \u2208 Rd be two samples from two continuous\nprobability distributions functions pX and pY , respectively. Estimating the OTM from pX to pY by\nsolving a linear program requiring O(n3 log(n)) computational time for \ufb01xed d [38, 47]. To alleviate\nthe computational burden, some literature [11, 17, 1, 21] pursued fast computation approaches of the\nOTM objective, i.e., the Wasserstein distance. Another school of methods aims to estimate the OTM\nef\ufb01ciently when d is small, including multi-scale approaches [35, 18] and dynamic formulations\n\ni=1 \u2208 Rd and {yi}n\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(cid:124)\ni \u03b8}n\n\ni=1 and {y\n\n(cid:124)\ni \u03b8}n\n\n[48, 36]. These methods utilize the space discretization, thus are generally not applicable in high-\ndimensional cases.\nThe random projection method (or known as the radon transformation method) is proposed to estimate\nOTMs ef\ufb01ciently when d is large [39, 40]. Such a method tackles the problem of estimating a d-\ndimensional OTM iteratively by breaking down the problem into a series of subproblems, each of\nwhich \ufb01nds a one-dimensional OTM using projected samples. Denote Sd\u22121 as the d-dimensional\nunit sphere. In each iteration, a random direction \u03b8 \u2208 Sd\u22121 is picked, and the one-dimensional OTM\nis then calculated between the projected samples {x\ni=1. The collection of all the\none-dimensional maps serves as the \ufb01nal estimate of the target OTM. The sliced method modi\ufb01es\nthe random projection method by considering a large set of random directions from Sd\u22121 in each\niteration [7, 42]. The \u201cmean map\u201d of the one-dimensional OTMs over these random directions is\nconsidered as a component of the \ufb01nal estimate of the target OTM. We call the random projection\nmethod, the sliced method, and their variants as the projection-based approach. Such an approach\nreduces the computational cost of calculating an OTM from O(n3 log(n)) to O(Kn log(n)), where\nK is the number of iterations until convergence. However, there is no theoretical guideline on the\norder of K. In addition, the existing projection-based approaches usually require a large number of\niterations to convergence or even fail to converge. We speculate that the slow convergence is because\na randomly selected projection direction may not be \u201cinformative\u201d, leading to a one-dimensional\nOTM that failed to be a decent representation of the target OTM. We illustrate such a phenomenon\nthrough an illustrative example as follows.\nAn illustrative example. The\nleft and right panels in Figure\n1 illustrates the importance of\nchoosing the \u201cinformative\u201d pro-\njection direction in OTM estima-\ntion. The goal is to obtain the\nOTM \u03c6\u2217 which maps a source\ndistribution pX (colored in red)\nto a target distribution pY (col-\nored in green). For each panel,\nwe \ufb01rst randomly pick a projec-\ntion direction (black arrow) and\nobtain the marginal distributions of pX and pY (the bell-shaped curves), respectively. The one-\ndimensional OTM then can be calculated based on the marginal distributions. Applying such a map\nto the source distribution yields the transformed distribution (colored in blue). One can observe that\nthe transformed distributions are signi\ufb01cantly different from the target ones. Such an observation\nindicates that the one-dimensional OTM with respect to a random projection direction may fail to\nwell-represent the target OTM. This observation motivates us to select the \u201cinformative\u201d projection\ndirection (red arrow), which yields a better one-dimensional OTM.\nOur contributions. To address the issues mentioned above, this paper introduces a novel statistical\napproach to estimate large-scale OTMs. The proposed method, named projection pursuit Monge map\n(PPMM), improves the existing projection-based approaches from two aspects. First, PPMM uses a\nsuf\ufb01cient dimension reduction technique to estimate the most \u201cinformative\u201d projection direction in\neach iteration. Second, PPMM is based on projection pursuit [16]. The idea is similar to boosting that\nsearch for the next optimal direction based on the residual of previous ones. Theoretically, we show\nthe proposed method can consistently estimate the most \u201cinformative\u201d projection direction in each\niteration, and the algorithm weakly convergences to the target large-scale OTM in a reasonable number\nof steps. The \ufb01nite sample performance of the proposed algorithm is evaluated by two applications:\nWasserstein distance estimation and generative model. We show the proposed method outperforms\nseveral state-of-the-art large-scale OTM estimation methods through extensive experiments on various\nsynthetic and real-world datasets.\n\nFigure 1: Illustration for the \u201cinformative\u201d projection direction\n\n2 Problem setup and methodology\nOptimal transport map and Wasserstein distance. Denote X \u2208 Rd and Y \u2208 Rd as two continuous\nrandom variables with probability distribution functions pX and pY , respectively. The problem of\n\ufb01nding a transport map \u03c6 : Rd \u2192 Rd such that \u03c6(X) and Y have the same distribution, has been\n\n2\n\n\fwidely-studied in mathematics, probability, and economics, see [14, 50, 43] for examples of some\nnew developments. Note that the transport map between the two distributions is not unique. Among\nall transport maps, it may be of interest to de\ufb01ne the \u201coptimal\u201d one according to some criteria. A\nstandard approach, named Monge formulation [52], is to \ufb01nd the OTM1 \u03c6\u2217 that satis\ufb01es\n\n\u03c6\u2217 = inf\n\u03c6\u2208\u03a6\n\nRd\n\n(cid:107)X \u2212 \u03c6(X)(cid:107)pdpX ,\n\nwhere \u03a6 is the set of all transport maps, (cid:107) \u00b7 (cid:107) is the vector norm and p is a positive integer. Given the\nexistence of the Monge map, the Wasserstein distance of order p is de\ufb01ned as\n\nWp(pX , pY ) =\n\nDenote (cid:98)\u03c6 as an estimator of \u03c6\u2217. Suppose one observe X = (x1, . . . , xn)\n\n(cid:124) \u2208 Rn\u00d7d and Y =\n(cid:124) \u2208 Rn\u00d7d from pX and pY , respectively. The Wasserstein distance Wp(pX , pY ) thus\n\nRd\n\n.\n\n(cid:107)X \u2212 \u03c6\u2217(X)(cid:107)pdpX\n\n(y1, . . . , yn)\ncan be estimated by\n\n(cid:19)1/p\n\n(cid:90)\n\n(cid:18)(cid:90)\n\n(cid:99)Wp(X, Y) =\n\n(cid:32)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:107)xi \u2212(cid:98)\u03c6(xi)(cid:107)p\n\n(cid:33)1/p\n\n.\n\nzi =\n\ns(cid:88)\ni=1 \u2208 R is the univariate response, {xi}n\n\n(cid:124)\nj xi) + \u0001i,\n\ni = 1, . . . , n,\n\nfj(\u03b2\n\nj=1\n\nProjection pursuit method. Projection pursuit regression [16, 24, 15, 26] is widely-used for high-\ndimensional nonparametric regression models which takes the form.\n\n(1)\n\ni=1 are i.i.d. normal errors. The goal is to estimate the unknown link functions {fj}s\n\nwhere s is a hyper-parameter, {zi}n\nand {\u0001i}n\nR \u2192 R and the unknown coef\ufb01cients {\u03b2j}s\nThe additive model (1) can be \ufb01tted in an iterative fashion. In the kth iteration, k = 2, . . . , s,\nj=1 obtained from previous k \u2212 1 iterations. Denote\n(cid:124)\nj xi), i = 1, . . . , n, the residuals. Then (fk, \u03b2k) can be estimated by solving\n\ndenote {((cid:98)fj,(cid:98)\u03b2j)}k\u22121\ni = zi \u2212(cid:80)k\u22121\nj=1 (cid:98)fj((cid:98)\u03b2\nj=1 the estimate of {(fj, \u03b2j)}k\u22121\n\ni=1 \u2208 Rd are covariates,\nj=1 :\n\nj=1 \u2208 Rd.\n\nR[k]\nthe following least squares problem\n\nn(cid:88)\n\n(cid:104)\n\ni=1\n\nmin\nfk,\u03b2k\n\n(cid:105)2\n\n.\n\ni \u2212 fk(\u03b2\nR[k]\n\n(cid:124)\nk xi)\n\nThe above iterative process explains the intuition behind the projection pursuit regression. Given\nthe model \ufb01tted in previous iterations, we \ufb01t a one dimensional regression model using the current\nresiduals, rather than the original responses. We then add this new regression model into the \ufb01tted\nfunction in order to update the residuals. By adding small regression models to the residuals, we\ngradually improve \ufb01tted model in areas where it does not perform well.\nThe intuition of projection pursuit regression motivates us to modify the existing projection-based\nOTM estimation approaches from two aspects. First, in the kth iteration, we propose to seek a new\nprojection direction for the one-dimensional OTM in the subspace spanned by the residuals of the\npreviously k \u2212 1 directions. On the contrary, following a direction that is in the span of used ones\ncan lead to an inef\ufb01cient one dimensional OTM. As a result, this \u201cmove\u201d may hardly reduce the\nWasserstein distance between pX and pY . Such inef\ufb01cient \u201cmoves\u201d can be one of the causes of\nthe convergence issue in existing projection-based OTM estimation algorithms. Second, in each\niteration, we propose to select the most \u201cinformative\u201d direction with respect to the current residuals\nrather than a random one. Speci\ufb01cally, we choose the direction that explains the highest proportion\nof variations in the subspace spanned by the current residuals. Intuitively, this direction addresses\nthe maximum marginal \u201cdiscrepancy\u201d between pX and pY among the ones that are not considered\nby previous iterations. We propose to estimate this most \u201cinformative\u201d direction with suf\ufb01cient\ndimension reduction techniques introduced as follows.\n\n1Such a map is thus also called the Monge map.\n\n3\n\n\fSuf\ufb01cient dimension reduction. Consider a regression problem with univariate response Z and\na d-dimensional predictor X. Suf\ufb01cient dimension reduction for regression aims to reduce the\ndimension of X while preserving its regression relation with Z. In other words, suf\ufb01cient dimension\nX with some B \u2208 Rd\u00d7q and q \u2264 d, such\n(cid:124)\nreduction seeks a set of linear combinations of X, say B\nX, i.e., Z \u22a5\u22a5 X|B\n(cid:124)\n(cid:124)\nthat Z depends on X only through B\nX. Then, the column space of B, denoted\nas S(B) is called a dimension reduction space (DRS). Furthermore, if the union of all possible DRSs\nis also a DRS, we call it the central subspace and denote it as SZ|X. When SZ|X exists, it is the\nminimum DRS. We call a suf\ufb01cient dimension reduction method exclusive if it induces a DRS that\nequals to the central subspace. Some popular suf\ufb01cient dimension reduction techniques include\nsliced inverse regression (SIR) [30], principal Hessian directions (PHD) [31], sliced average variance\nestimator (SAVE) [9], directional regression (DR) [29], among others.\nEstimation of the most \u201cinformative\u201d projec-\ntion direction. Consider estimating an OTM be-\ntween a source sample and a target sample. We\n\ufb01rst form a regression problem by adding a bi-\nnary response, which equals zero for the source\nsample and one for the target sample. We then uti-\nlize the suf\ufb01cient dimension reduction technique\nto select the most \u201cinformative\u201d projection direc-\ntion. To be speci\ufb01c, we select the projection di-\nrection \u03be \u2208 Rd as the eigenvector corresponds to\nthe largest eigenvalue of the estimated B. The\ndirection \u03be is most \u201cinformative\u201d in the sense that,\nthe projected samples X\u03be and Y\u03be have the most\nsubstantial \u201c discrepancy.\u201d The metric of the \u201cdis-\ncrepancy\u201d depends on the choice of the suf\ufb01cient\ndimension reduction technique. Figure 2 gives a\ntoy example to illustrate this idea. In this paper,\nwe opt to use SAVE for calculating B, and hence\nthe \u201cdiscrepancy\u201d metric is the difference between Var(X\u03be) and Var(Y\u03be). Empirically, we \ufb01nd other\nsuf\ufb01cient dimension reduction techniques, like PHD and DR, also yield similar performance. The\nSIR method, however, yields inferior performance, since it only considers the \ufb01rst moment. The\nAlgorithm 1 below introduces our estimation method of \u201cinformative\" projection direction in detail.\n\nFigure 2: The most \u201cinformative\u201d projection di-\nrection ensures the projected samples (illustrated\nby the distributions colored in red and blue, re-\nspectively) have the largest \u201cdiscrepancy\u201d.\n\nAlgorithm 1 Select the most \u201cinformative\u201d projection direction using SAVE\n\nInput: two standardized matrix X \u2208 Rn\u00d7d and Y \u2208 Rn\u00d7d\n\nStep 1: calculate (cid:98)\u03a3 \u2208 Rd\u00d7d, i.e., the sample variance-covariance matrix of(cid:0)X\nStep 2: calculate the sample variance-covariance matrices of X(cid:98)\u03a3\u22121/2 and Y(cid:98)\u03a3\u22121/2, denoted as\n(cid:98)\u03a31 \u2208 Rd\u00d7d and (cid:98)\u03a32 \u2208 Rd\u00d7d, respectively\nmatrix (((cid:98)\u03a31 \u2212 Id)2 + ((cid:98)\u03a32 \u2212 Id)2)/4\nOutput: the \ufb01nal result is given by (cid:98)\u03a3\u22121/2\u03be/||(cid:98)\u03a3\u22121/2\u03be||, where || \u00b7 || denotes the Euclidean norm\n\nStep 3: calculate the eigenvector \u03be \u2208 Rd, which corresponding to the largest eigenvalue of the\n\n(cid:1)\n\nY\n\nProjection pursuit Monge map algorithm. Now, we are ready to present our estimation method\nfor large-scale OTM. The detailed algorithm, named projection pursuit Monge map, is summarized\nin Algorithm 2 below. In each iteration, the PPMM applies a one-dimensional OTM following the\nmost \u201cinformative\u201d projection direction selected by the Algorithm 1.\nComputational cost of PPMM. In Algorithm 2, the computational cost mainly resides in the \ufb01rst\ntwo steps within each iteration. In step (a), one calculates \u03bek using Algorithm 1, whose computational\ncost is of order O(nd2). In step (b), one calculates a one-dimensional OTM using the look-up table,\nwhich is simply a sorting algorithm [40, 38].\nThe computational cost for step (b) is of order O(n log(n)). Suppose that the algorithm converges\nEmpirically, we \ufb01nd K = O(d) works reasonably well. When log(n)1/2 \u2264 d (cid:28) n2/3, the order\n\nafter K iterations. The overall computational cost of Algorithm 2 is of order O(cid:0)Knd2 + Kn log(n)(cid:1).\nof computational cost of PPMM is o(cid:0)n3 log(n)(cid:1) which is smaller than the computational cost of\n\n4\n\n\fAlgorithm 2 Projection pursuit Monge map (PPMM)\n\nInput: two matrix X \u2208 Rn\u00d7d and Y \u2208 Rn\u00d7d\nk \u2190 0, X[0] \u2190 X\nrepeat\n\n(a) calculate the projection direction \u03bek \u2208 Rd between X[k] and Y (using Algorithm 1)\n(b) \ufb01nd the one-dimensional OTM \u03c6(k) that matches X[k]\u03bek to Y\u03bek (using look-up table)\n(c) X[k+1] \u2190 X[k] + (\u03c6(k)(X[k]\u03bek) \u2212 X[k]\u03bek)\u03be\n\n(cid:124)\nk and k \u2190 k + 1\n\nuntil converge\n\nThe \ufb01nal estimator is given by(cid:98)\u03c6 : X \u2192 X[k]\n\nthe naive method for calculating OTMs. When d \u2264 log(n)1/2, the order of computational cost\nreduces to O (Kn log(n)) which is faster than the exiting projection-based methods given PPMM\nconverges faster. The memory cost for Algorithm 2 mainly resides in the step (a), which is of the\norder O(Knd2).\n\n3 Theoretical results\n\nExclusiveness of SAVE. For mathematical simplicity, we assume E[X] = E[Y ] = 0d. When\nE[X] (cid:54)= E[Y ], one can use a \ufb01rst-order dimension reduction method like SIR to adjust means before\napplying SAVE.\n\n\u22121/2\nDenote W = (X + Y )/2, \u03a3W = Var(W ), and Z = W \u03a3\nW . For a univariate continuous response\nvariable R, one can approximate the central subspace SR|Z by SSAVE, which is the population version\nof the dimension reduction space of SAVE. To be speci\ufb01c, SSAVE is the column space of matrix\n\nW |R) \u2212 Id]2(cid:111)\n\n\u22121/2\n\n,\n\n(cid:110)\n\n1\n4\n\nE[Var(Z|R) \u2212 Id]2 =\n\nE[Var(X\u03a3\n\n\u22121/2\nW |R) \u2212 Id]2 + E[Var(Y \u03a3\n\nwhere the above equation used the fact that X \u22a5\u22a5 Y .\nAssumption 1. Let P be the projection onto the central space SR|Z with respect to the inner project\na \u00b7 b = a\nb. For any nonzero vectors u, v \u2208 Rd, such that u is orthogonal to SR|Z and v \u2208 SR|Z, we\n(cid:124)\nassume\n\n(cid:124)\n\nZ|P Z) is a linear function of Z;\n(a) E(u\nZ|P Z) is a nonrandom number;\n(cid:124)\n(b) Var(u\n\n(c) Let ((cid:101)Z,(cid:101)R) be an independent copy of (Z, R). E\n\nthat is, it is not equal almost surely to a constant.\n\n(cid:104)\n\nv\n\n(cid:105)\n(Z \u2212 (cid:101)Z)2|R,(cid:101)R)\n\n(cid:124)\n\nis non degenerate;\n\nance matrix estimator of \u03a31 and \u03a32, respectively. Denote\n\nTheorem 1. Let R be a univariate continuous response variable. Under Assumption 1, the dimension\nreduction space induced by SAVE is exclusive. In other words, SSAVE = SR|Z.\n\nConsistency of the most \u201cinformative\u201d projection direction. Let(cid:98)\u03a31 and(cid:98)\u03a32 be the sample covari-\n(cid:104)\n((cid:98)\u03a31 \u2212 Id)2 + ((cid:98)\u03a32 \u2212 Id)2(cid:105)\nDenote \u03be1 and (cid:98)\u03be1 the eigenvectors correspond to the largest eigenvalues of \u03a3SAVE and (cid:98)\u03a3SAVE,\n\n(cid:2)(\u03a31 \u2212 Id)2 + (\u03a32 \u2212 Id)2(cid:3)\n\nand (cid:98)\u03a3SAVE =\n\n\u03a3SAVE =\n\n1\n4\n\n1\n4\n\n.\n\nrespectively. Further, denote r = Rank(\u03a3SAVE), the rank of \u03a3SAVE.\nAssumption 2. Let {xi, yi}n\n\ni=1 be an i.i.d. sample of (X, Y ). We assume that\n\n(a) Denote xij and yik the jth and kth component of xi and yi, respectively. E(xijyik) = 0\n\nfor all 1 \u2264 i \u2264 n and 1 \u2264 j, k \u2264 d;\n\n(b) There are r1, r2 > 0 and b1, b2 > 0 such that, for any s > 0, 1 \u2264 i \u2264 n and 1 \u2264 j \u2264 d,\n\nP (|xij| > s) \u2264 exp{\u2212(s/b1)r1}\n\nand P (|yij| > s) \u2264 exp{\u2212(s/b2)r2} ;\n\n5\n\n\f(c) Let \u03bb1 . . . , \u03bbd be the eigenvalues of \u03a3SAVE in descending order. There exist positive\n\nconstants cl cu and c3 such that\n\ncl \u2264 min\n1\u2264l\u2264r\u22121\n\n(\u03bbl \u2212 \u03bbl+1)d\u22121/2 \u2264 cu,\n\nand\n\n0 \u2264 \u03bbr+1 < c3.\n\nTheorem 2 shows that Algorithm 1 can consistently estimate the most \u201cinformative\u201d projection\ndirection. The Op in Theorem 2 stands for order in probability, which is similar to O but for random\nvariables.\nTheorem 2. Under Assumption 2, the SAVE estimator of most \u201cinformative\u201d projection direction\nsatis\ufb01es,\n\n(cid:107)(cid:98)\u03be1 \u2212 \u03be1(cid:107)\u221e = Op(r4\n\n(cid:114)\n\n\u221a\n\nlog d\n\nn\n\n+ r4\n\nd\n\nlog d\n\nn\n\n),\n\nas n, d \u2192 \u221e.\n\nWeak convergence of PPMM algorithm. Denote \u03c6\u2217 as the d-dimensional optimal transport map\nfrom pX to pY and \u03c6(K) as the PPMM estimator after K iterations, i.e. \u03c6(K)(X) = X[K]. The\nfollowing theorem gives the weak convergence results of the PPMM algorithm.\nTheorem 3. Suppose Assumption 1 and Assumption 2 hold. Let K \u2265 Cd for some large enough\npositive constant C, one has\n\n(cid:16)\n\n(cid:99)Wp\n\n(cid:17) \u2192 Wp\n\n(cid:16)\n\n(cid:17)\n\n\u03c6(K)(X), X\n\n\u03c6\u2217(X), X\n\n,\n\nand \u03c6(K)(X) \u2192 \u03c6\u2217(X) as n \u2192 \u221e.\n\nWorks are proving the convergence rates of the empirical optimal transport objectives [5, 49, 6, 54].\nThe convergence rate of the OTM has rarely been studied except for a recent paper [25]. We believe\nTheorem 3 is the \ufb01rst step in this direction.\n\n4 Numerical experiments\n\n4.1 Estimation of optimal transport map\n\n(cid:124) from pX = Nd(\u00b5X , \u03a3X ) and Y =\n(cid:124) from pY = Nd(\u00b5Y , \u03a3Y ), respectively. We set n = 10, 000, d = {10, 20, 50},\n\nSuppose that we observe i.i.d. samples X = (x1, . . . , xn)\n(y1, . . . , yn)\n\u00b5X = \u22122, \u00b5Y = 2, \u03a3X = 0.8|i\u2212j|, and \u03a3Y = 0.5|i\u2212j|, for i, j = 1, . . . , d.\nWe apply PPMM to estimate the OTM between pX and pY from {xi}n\ni=1. In comparison,\nwe also consider the following two projection-based competitors: (1) the random projection method\n(RANDOM) as proposed in [39, 40]; (2) the sliced method as proposed in [7, 42]. The number of\nslices L is set to be 10, 20, and 50. We assess the convergence of each method by the estimated\n, where \u03c6(k)(\u00b7) is the\nestimator of OTM after kth iteration. For all three methods, we set the maximum number of iterations\nto be 200. Notice that, the Wasserstein distance between pX and pY admits a closed form,\n\nWasserstein distance of order 2 after each iteration, i.e. (cid:99)W2\n\ni=1 and {yi}n\n(cid:17)\n\n\u03c6(k)(X), X\n\n(cid:16)\n\nX )1/2(cid:17)\n\nX \u03a3Y \u03a31/2\n\n,\n\n(2)\n\n2 + trace\n\n2 (pX , pY ) = ||\u00b5X \u2212 \u00b5Y ||2\nW 2\n\n\u03a3X + \u03a3Y \u2212 2(\u03a31/2\nwhich serves as the ground-truth. The results are presented in Figure 3.\nIn all three scenarios, PPMM (red\nline) converges to the ground truth\nwithin a small number of iterations.\nThe \ufb02uctuations of the convergence\ncurves observed in Figure 3 are caused\nby the non-equal sample means. This\ncan be adjusted by applying a \ufb01rst-\norder dimension reduction method\n(e.g., SIR). We do not pursue this ap-\nproach as the \ufb02uctuations do not cover\nthe main pattern in Figure 3. When\nd = 10, RANDOM and SLICED con-\nverge to the ground truth but in a much\n\nFigure 3: The black dashed line is the true value of the\nWasserstein distance as in (2). The colored lines represent\nthe sample mean of the estimated Wasserstein distances over\n100 replications, and the vertical bars represent the standard\ndeviations.\n\n(cid:16)\n\n6\n\n\fslower manner. When d = 20 and 50, neither RANDOM nor SLICED manages to converge within\n200 iterations. We also \ufb01nd a large number of slices L does not necessarily lead to a better estimation\nfor the SLICED method. As we can see, PPMM is the only one among three that is adaptive to\nlarge-scale OTM estimation problems.\nIn Table 1 below, we compare the computational cost of three methods by reporting the CPU time per\niteration over 100 replication.2 As we expected, the RANDOM method has the lowest CPU time per\niteration due to it does not select projection direction. We notice that the CPU time per iteration of\nthe SLICED method is proportional to the number of slices L. Last but not least, the CPU time per\niteration of PPMM is slightly larger than RANDOM but much smaller than SLICED.\n\nTable 1: The mean CPU time (sec) per iteration, with standard deviations presented in parentheses\n\nd = 10\nd = 20\nd = 50\n\nPPMM\n\n0.019 (0.008)\n0.027 (0.011)\n0.059 (0.036)\n\nRANDOM\n0.011 (0.008)\n0.014 (0.008)\n0.015 (0.008)\n\nSLICED(10)\n0.111 (0.019)\n0.125 (0.027)\n0.171 (0.037)\n\nSLICED(20)\n0.213 (0.024)\n0.247 (0.033)\n0.338 (0.049)\n\nSLICED(50)\n0.529 (0.031)\n0.605 (0.058)\n0.863 (0.117)\n\nIn the Table 2 below, we report the mean convergence time over 100 replications for PPMM,\nRANDOM, SLICED, the re\ufb01ned auction algorithm (AUCTIONBF)[3], the revised simplex algorithm\n(REVSIM) [34] and the shortlist method (SHORTSIM) [20].3 Table 2 shows that the PPMM is the\nmost computationally ef\ufb01cient method thanks to its cheap per iteration cost and fast convergence.\n\nTable 2: The mean convergence time (sec) for estimating the Wasserstein distance, with standard\ndeviations presented in parentheses. The symbol \u201c-\u201d is inserted when the algorithm fails to converge.\n\nd = 10\nd = 20\nd = 50\n\nPPMM RANDOM SLICED(10) AUCTIONBF REVSIM SHORTSIM\n42.5 (3.2)\n0.6 (0.1)\n50.2 (6.6)\n2.1 (0.3)\n5.5 (0.4)\n56.5 (7.1)\n\n99.7 (10.4)\n109.4 (12.5)\n125.5 (13.3)\n\n40.2 (4.0)\n42.6 (5.3)\n46.5 (5.6)\n\n4.8 (1.7)\n24.4 (3.2)\n\n23.0 (2.6)\n230.2 (28.4)\n\n-\n\n-\n\n4.2 Application to generative models\n\nFigure 4: Illustration for the generative model using manifold\nlearning and optimal transport\n\nA critical issue in generative models is\nthe so-called mode collapse, i.e., the\ngenerated \u201cfake\u201d sample fails to cap-\nture some modes present in the train-\ning data [22, 45]. To address this is-\nsue, recent studies [51, 22, 28] incor-\nporated generative models with the op-\ntimal transportation theory. As illus-\ntrated in Figure 4, one can decompose\nthe problem of generating fake sam-\nples into two major steps: (1) manifold learning and (2) probability transformation. The step (1) aims\nto discover the manifold structure of the training data by mapping the training data from the original\nspace X \u2282 Rd to a latent space Z \u2282 Rd\u2217\nwith d\u2217 (cid:28) d. Notice that the probability distribution\nof the transformed data in Z may not be convex, leading to the problem of mode collapse. The\nstep (2) then addresses the mode collapse issue through transporting the distribution in Z to the\nuniform distribution U ([0, 1]d\u2217\n). Then, the generative model takes a random input from U ([0, 1]d\u2217\n)\nand sequentially applies the inverse transformations in step (2) and step (1) to generate the output. In\npractice, one may implement the step (1) and (2) using variational autoencoders (VAE) and OTM,\nrespectively. As we can see, the estimation of OTM plays an essential role in this framework.\nIn this subsection, we apply PPMM as well as RANDOM and SLICED to generative models to study\ntwo datasets: MINST and Google doodle dataset. For the SLICED method, we set the number of\nslices to be 10, 20, and 50. For all three methods, we set the number of iterations is set to be 10d\u2217.\nWe use the squared Euclidean distance as the cost for the VAE model.\n\n2The experiments are implemented by an Intel 2.6 GHz processor.\n3AUCTIONBF, REVSIM and SHORTSIM are implemented by the R package \u201ctransport\u201d [46].\n\n7\n\n\fTable 3: The FID for the generated samples (lower the better), with standard deviations presented in\nparentheses\n\nMNIST\n\nDoodle (face)\nDoodle (cat)\nDoodle (bird)\n\nPPMM\n\n0.17 (0.01)\n0.59 (0.09)\n0.24 (0.03)\n0.36 (0.03)\n\nRANDOM SLICED(10)\n2.98 (0.01)\n4.62 (0.02)\n8.78 (0.04)\n5.69 (0.01)\n5.99 (0.01)\n8.93 (0.03)\n7.81 (0.03)\n5.44 (0.01)\n\nSLICED(20)\n3.04 (0.01)\n6.01 (0.01)\n5.26 (0.01)\n5.50 (0.01)\n\nSLICED(50)\n3.12 (0.01)\n5.52 (0.01)\n5.33 (0.01)\n4.98 (0.01)\n\nMNIST. We \ufb01rst study the MNIST\ndataset, which contains 60,000 train-\ning images and 10,000 testing images\nof hand written digits. We pull each\n28 \u00d7 28 image to a 784-dimensional\nvector and rescale the grayscale val-\nues from [0, 255] to [0, 1]. Following\nthe method in [51], we apply VAE to\nencode the data into a latent space Z\nof dimensionality d\u2217 = 8. Then, the\nOTM from the distribution in Z to\nU ([0, 1]8) is estimated by PPMM as\nwell as RANDOM and SLICED.\nFirst, we visually examine the fake sample generated with PPMM. In the left-hand panel of Figure 5,\nwe display some random images generated by PPMM. The right-hand panel of Figure 5 shows that\nPPMM can predict the continuous shift from one digit to another. To be speci\ufb01c, let a, b \u2208 R784 be\nthe sample of two digits (e.g. 3 and 9) in the testing set. Let T : X \u2192 Z be the map induced by\n\nVAE and(cid:98)\u03c6 the OTM estimated by PPMM. Then,(cid:98)\u03c6(T (\u00b7)) maps the sample distribution to U ([0, 1]8).\nWe linearly interpolate between(cid:98)\u03c6(T (a)) and(cid:98)\u03c6(T (b)) with equal-size steps. Then we transform the\n\nFigure 5: Left: random samples generated by PPMM. Right:\nlinear interpolation between random pairs of images.\n\ninterpolated points back to the sample distribution to generate the middle columns in the right panel\nof Figure 5.\nWe use the \u201cFr\u00b4echet Inception Distance\u201d (FID) [23] to quantify the similarity between the generated\nfake sample and the training sample. Speci\ufb01cally, we \ufb01rst generate 1,000 random inputs from\nU ([0, 1]8). We then apply PPMM, RANDOM, and SLICED to this input sample, yields the fake\nsamples in the latent space Z. Finally, we calculate the FID between the encoded training sample\nin the latent space and the generated fake samples, respectively. A small value of FID indicates the\ngenerated fake sample is similar to the training sample and vice versa. The sample mean and sample\nstandard deviation (in parentheses) of FID over 50 replications are presented in Table 3. Table 3\nindicates PPMM signi\ufb01cantly outperforms the other two methods in terms of estimating the OTM.\nGoogle doodle dataset. The Google Doodle dataset4 contains over 50 million drawings created\nby users with a mouse under 20 secs. We analyze a pre-processed version of this dataset from the\nquick draw Github account5. In the dataset we use, the drawings are centered and rendered into\n28 \u00d7 28 grayscale images. We pull each 28 \u00d7 28 image to a 784-dimensional vector and rescale\nthe grayscale values from [0, 255] to [0, 1]. In this experiment, we study the drawings from three\ndifferent categories: smile face, cat, and bird. These three categories contain 161,666, 123,202, and\n133,572 drawings, respectively. Within each category, we randomly split the data into a training set\nand a validation set of equal sample sizes.\nWe apply VAE to the training set with a stopping criterion selected by the validation set. The\ndimension of the latent space is set to be 16. Let a, b \u2208 R784 be two vectors in the validation\n\nset, T : X \u2192 Z be the map induced by VAE and (cid:98)\u03c6 be the OTM estimated by PPMM. Note that\n(cid:98)\u03c6(T (\u00b7)) maps the sample distribution to U ([0, 1]16). We then linearly interpolate between(cid:98)\u03c6(T (a))\nand(cid:98)\u03c6(T (b)) with equal-size steps. The results are presented in Figure 6.\n\n4https://quickdraw.withgoogle.com/data\n5https://github.com/googlecreativelab/quickdraw-dataset\n\n8\n\n\fFigure 6: Linear interpolation between random pairs of images from the dataset of smile face (left),\ncat (center), and bird (right).\n\nThen, we quantify the similarity between the generated fake samples and the truth by calculating the\nFID in the latent space. The sample mean and sample standard deviation (in parentheses) of FID over\n50 replications are presented in Table 3. Again, the results in Table 3 justify the superior performance\nof PPMM over existing projection-based methods.\n\n5 Extensions\n\nFirst, the PPMM can be extended to address the peniten-\ntial heterogeneous in the dataset by assigning non-equal\nweights to the points in source and target samples. This\nis equivalent to calculate weighted variance-covariance\nmatrices in Step 2 of Algorithm 1. Second, the PPMM\nmethod can be modi\ufb01ed to allow the sizes of the source\nand target samples to be different. In such a scenario,\nwe can replace the look-up table in the Step (b) of Algo-\nrithm 2 with an approximate lookup table. Recall that\nthe one-dimensional lookup table is just sorting, the one-\ndimensional approximate look-up table can be achieved\nby combining sorting and linear interpolation. We validate\nthe above extensions with a simulated experiment similar\nto the one in Section 4.1 except that we draw 5, 000 and\n1, 000 points from pX and pY , respectively. We set d = 10 and assign weights to the observations\nrandomly. The estimation results are presented in Figure 7. In addition, the average convergence time\nis: PPMM(0.3s), RANDOM (1.4s), SLICED10 (14s) and SHORTSIM (74s).\nTheorem 3 suggests that, for the PPMM algorithm, the\nnumber of iterations until converge, i.e., K, is on the order\nof dimensionality d. Here we use a simulated example\nto assess whether this order is attainable. We follow a\nsimilar setting as in Section 4.1 except that we increase d\nfrom 10 to 100 with a step size of 10. Besides, we set the\ntermination criteria to be a hard threshold, i.e., 10\u22125. In\nFigure 8, we report the sample mean (solid line) and stan-\ndard deviation (vertical bars) of K over 100 replications\nwith respect to the increased d. One can observe a clear\nlinear pattern.\n\nFigure 7: Experiment for heterogeneous\ndata with non-equal sample sizes. The\nblack dashed line is the oracle calculated by\nSHORTSIM\n\nFigure 8: Number of iterations to converge\n\nAcknowledgment\n\nWe would like to thank Xiaoxiao Sun, Rui Xie, Xinlian Zhang, Yiwen Liu, and Xing Xin for many\nfruitful discussions. We would also like to thank Dr. Xianfeng David Gu for his insightful blog\nabout the Optimal transportation theory. Also, we would like to thank the UC Irvine Machine\nLearning Repository for dataset assistance. This work was partially supported by National Science\nFoundation grants DMS-1440037, DMS-1440038, DMS-1438957, and NIH grants R01GM113242,\nR01GM122080.\n\n9\n\n\fReferences\n[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In\n\nInternational Conference on Machine Learning, pages 214\u2013223, 2017.\n\n[2] J.-D. Benamou, Y. Brenier, and K. Guittet. The monge\u2013kantorovitch mass transfer and its\ncomputational \ufb02uid mechanics formulation. International Journal for Numerical methods in\n\ufb02uids, 40(1-2):21\u201330, 2002.\n\n[3] D. P. Bertsekas. Auction algorithms for network \ufb02ow problems: A tutorial introduction.\n\nComputational optimization and applications, 1(1):7\u201366, 1992.\n\n[4] M. Blaauw and J. Bonada. Modeling and transforming speech using variational autoencoders.\n\nIn Interspeech, pages 1770\u20131774, 2016.\n\n[5] E. Boissard et al. Simple bounds for the convergence of empirical and occupation measures in\n\n1-wasserstein distance. Electronic Journal of Probability, 16:2296\u20132333, 2011.\n\n[6] E. Boissard and T. Le Gouic. On the mean speed of convergence of empirical and occupation\nmeasures in wasserstein distance. In Annales de l\u2019IHP Probabilit\u00e9s et statistiques, volume 50,\npages 539\u2013563, 2014.\n\n[7] N. Bonneel, J. Rabin, G. Peyr\u00e9, and H. P\ufb01ster. Sliced and radon wasserstein barycenters of\n\nmeasures. Journal of Mathematical Imaging and Vision, 51(1):22\u201345, 2015.\n\n[8] Y. Brenier. A homogenized model for vortex sheets. Archive for Rational Mechanics and\n\nAnalysis, 138(4):319\u2013353, 1997.\n\n[9] R. D. Cook and S. Weisberg. Sliced inverse regression for dimension reduction: Comment.\n\nJournal of the American Statistical Association, 86(414):328\u2013332, 1991.\n\n[10] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain\nadaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853\u20131865,\n2017.\n\n[11] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in\n\nneural information processing systems, pages 2292\u20132300, 2013.\n\n[12] A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on\ndeep networks. In Advances in neural information processing systems, pages 658\u2013666, 2016.\n\n[13] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan. Neural\nIn Proceedings of the 34th\naudio synthesis of musical notes with wavenet autoencoders.\nInternational Conference on Machine Learning-Volume 70, pages 1068\u20131077. JMLR. org,\n2017.\n\n[14] S. Ferradans, N. Papadakis, G. Peyr\u00e9, and J.-F. Aujol. Regularized discrete optimal transport.\n\nSIAM Journal on Imaging Sciences, 7(3):1853\u20131882, 2014.\n\n[15] J. H. Friedman. Exploratory projection pursuit. Journal of the American statistical association,\n\n82(397):249\u2013266, 1987.\n\n[16] J. H. Friedman and W. Stuetzle. Projection pursuit regression. Journal of the American statistical\n\nAssociation, 76(376):817\u2013823, 1981.\n\n[17] A. Genevay, M. Cuturi, G. Peyr\u00e9, and F. Bach. Stochastic optimization for large-scale optimal\n\ntransport. In Advances in neural information processing systems, pages 3440\u20133448, 2016.\n\n[18] S. Gerber and M. Maggioni. Multiscale strategies for computing optimal transport. The Journal\n\nof Machine Learning Research, 18(1):2440\u20132471, 2017.\n\n[19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\npages 2672\u20132680, 2014.\n\n10\n\n\f[20] C. Gottschlich and D. Schuhmacher. The shortlist method for fast computation of the\nearth mover\u2019s distance and \ufb01nding optimal solutions to transportation problems. PloS one,\n9(10):e110214, 2014.\n\n[21] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of\nwasserstein gans. In Advances in Neural Information Processing Systems, pages 5769\u20135779,\n2017.\n\n[22] Y. Guo, D. An, X. Qi, Z. Luo, S.-T. Yau, X. Gu, et al. Mode collapse and regularity of optimal\n\ntransportation maps. arXiv preprint arXiv:1902.02934, 2019.\n\n[23] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two\ntime-scale update rule converge to a local nash equilibrium. In Advances in Neural Information\nProcessing Systems, pages 6626\u20136637, 2017.\n\n[24] P. J. Huber. Projection pursuit. The annals of Statistics, pages 435\u2013475, 1985.\n\n[25] J.-C. H\u00fctter and P. Rigollet. Minimax rates of estimation for smooth optimal transport maps.\n\narXiv preprint arXiv:1905.05828, 2019.\n\n[26] A. Ifarraguerri and C.-I. Chang. Unsupervised hyperspectral image analysis with projection\n\npursuit. IEEE Transactions on Geoscience and Remote Sensing, 38(6):2529\u20132538, 2000.\n\n[27] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[28] S. Kolouri, P. E. Pope, C. E. Martin, and G. K. Rohde. Sliced-wasserstein autoencoder: An\n\nembarrassingly simple generative model. arXiv preprint arXiv:1804.01947, 2018.\n\n[29] B. Li and S. Wang. On directional regression for dimension reduction. Journal of the American\n\nStatistical Association, 102(479):997\u20131008, 2007.\n\n[30] K.-C. Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical\n\nAssociation, 86(414):316\u2013327, 1991.\n\n[31] K.-C. Li. On principal hessian directions for data visualization and dimension reduction:\nJournal of the American Statistical Association,\n\nAnother application of stein\u2019s lemma.\n87(420):1025\u20131039, 1992.\n\n[32] X. Liang, L. Lee, W. Dai, and E. P. Xing. Dual motion gan for future-\ufb02ow embedded video\nprediction. In Proceedings of the IEEE International Conference on Computer Vision, pages\n1744\u20131752, 2017.\n\n[33] Y. Liu, Z. Qin, Z. Luo, and H. Wang. Auto-painter: Cartoon image generation from sketch by\n\nusing conditional generative adversarial networks. arXiv preprint arXiv:1705.01908, 2017.\n\n[34] D. G. Luenberger, Y. Ye, et al. Linear and nonlinear programming, volume 2. Springer, 1984.\n\n[35] Q. M\u00e9rigot. A multiscale approach to optimal transport.\nvolume 30, pages 1583\u20131592. Wiley Online Library, 2011.\n\nIn Computer Graphics Forum,\n\n[36] N. Papadakis, G. Peyr\u00e9, and E. Oudet. Optimal transport with proximal splitting. SIAM Journal\n\non Imaging Sciences, 7(1):212\u2013238, 2014.\n\n[37] O. Pele and M. Werman. Fast and robust earth mover\u2019s distances. In 2009 IEEE 12th Interna-\n\ntional Conference on Computer Vision, pages 460\u2013467. IEEE, 2009.\n\n[38] G. Peyr\u00e9, M. Cuturi, et al. Computational optimal transport. Foundations and Trends R(cid:13) in\n\nMachine Learning, 11(5-6):355\u2013607, 2019.\n\n[39] F. Pitie, A. C. Kokaram, and R. Dahyot. N-dimensional probability density function transfer and\nits application to color transfer. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International\nConference on, volume 2, pages 1434\u20131439. IEEE, 2005.\n\n11\n\n\f[40] F. Piti\u00e9, A. C. Kokaram, and R. Dahyot. Automated colour grading using colour distribution\n\ntransfer. Computer Vision and Image Understanding, 107(1-2):123\u2013137, 2007.\n\n[41] J. Rabin, S. Ferradans, and N. Papadakis. Adaptive color transfer with relaxed optimal transport.\nIn 2014 IEEE International Conference on Image Processing (ICIP), pages 4852\u20134856. IEEE,\n2014.\n\n[42] J. Rabin, G. Peyr\u00e9, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture\nmixing. In International Conference on Scale Space and Variational Methods in Computer\nVision, pages 435\u2013446. Springer, 2011.\n\n[43] S. Reich. A nonparametric ensemble transform method for bayesian inference. SIAM Journal\n\non Scienti\ufb01c Computing, 35(4):A2013\u2013A2024, 2013.\n\n[44] Y. Rubner, L. J. Guibas, and C. Tomasi. The earth mover\u2019s distance, multi-dimensional scaling,\nand color-based image retrieval. In Proceedings of the ARPA image understanding workshop,\nvolume 661, page 668, 1997.\n\n[45] T. Salimans, H. Zhang, A. Radford, and D. Metaxas. Improving gans using optimal transport.\n\narXiv preprint arXiv:1803.05573, 2018.\n\n[46] D. Schuhmacher, B. B\u00e4hre, C. Gottschlich, V. Hartmann, F. Heinemann, and B. Schmitzer.\nTransport: Computation of Optimal Transport Plans and Wasserstein Distances, 2019. R\npackage version 0.12-1.\n\n[47] V. Seguy, B. B. Damodaran, R. Flamary, N. Courty, A. Rolet, and M. Blondel. Large-scale\n\noptimal transport and mapping estimation. arXiv preprint arXiv:1711.02283, 2017.\n\n[48] J. Solomon, R. Rustamov, L. Guibas, and A. Butscher. Earth mover\u2019s distances on discrete\n\nsurfaces. ACM Transactions on Graphics (TOG), 33(4):67, 2014.\n\n[49] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Sch\u00f6lkopf, G. R. Lanckriet, et al. On the\nempirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550\u2013\n1599, 2012.\n\n[50] Z. Su, Y. Wang, R. Shi, W. Zeng, J. Sun, F. Luo, and X. Gu. Optimal mass transport for shape\nmatching and comparison. IEEE transactions on pattern analysis and machine intelligence,\n37(11):2246\u20132259, 2015.\n\n[51] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders. arXiv\n\npreprint arXiv:1711.01558, 2017.\n\n[52] C. Villani. Optimal transport: old and new. Springer Science & Business Media, 2008.\n\n[53] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In\n\nAdvances In Neural Information Processing Systems, pages 613\u2013621, 2016.\n\n[54] J. Weed and F. Bach. Sharp asymptotic and \ufb01nite-sample rates of convergence of empirical\n\nmeasures in wasserstein distance. arXiv preprint arXiv:1707.00087, 2017.\n\n12\n\n\f", "award": [], "sourceid": 4437, "authors": [{"given_name": "Cheng", "family_name": "Meng", "institution": "University of Georgia"}, {"given_name": "Yuan", "family_name": "Ke", "institution": "University of Georgia"}, {"given_name": "Jingyi", "family_name": "Zhang", "institution": "The University of Georgia"}, {"given_name": "Mengrui", "family_name": "Zhang", "institution": "University of Georgia"}, {"given_name": "Wenxuan", "family_name": "Zhong", "institution": null}, {"given_name": "Ping", "family_name": "Ma", "institution": "University of Georgia"}]}