{"title": "Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem", "book": "Advances in Neural Information Processing Systems", "page_first": 4541, "page_last": 4551, "abstract": "We prove several fundamental statistical bounds for entropic OT with the squared Euclidean cost between subgaussian probability measures in arbitrary dimension.\nFirst, through a new sample complexity result we establish the rate of convergence of entropic OT for empirical measures.\nOur analysis improves exponentially on the bound of Genevay et al.~(2019) and extends their work to unbounded measures.\nSecond, we establish a central limit theorem for entropic OT, based on techniques developed by Del Barrio and Loubes~(2019).\nPreviously, such a result was only known for finite metric spaces.\nAs an application of our results, we develop and analyze a new technique for estimating the entropy of a random variable corrupted by gaussian noise.", "full_text": "Statistical bounds for entropic optimal transport:\nsample complexity and the central limit theorem\n\nGonzalo Mena\n\nHarvard\n\nJonathan Niles-Weed\n\nNYU\n\nAbstract\n\nWe prove several fundamental statistical bounds for entropic OT with the squared\nEuclidean cost between subgaussian probability measures in arbitrary dimension.\nFirst, through a new sample complexity result we establish the rate of convergence\nof entropic OT for empirical measures. Our analysis improves exponentially on\nthe bound of Genevay et al. (2019) and extends their work to unbounded measures.\nSecond, we establish a central limit theorem for entropic OT, based on techniques\ndeveloped by Del Barrio and Loubes (2019). Previously, such a result was only\nknown for \ufb01nite metric spaces. As an application of our results, we develop and\nanalyze a new technique for estimating the entropy of a random variable corrupted\nby gaussian noise.\n\n1\n\nIntroduction\n\nOptimal transport is an increasingly popular tool for the analysis of large data sets in high dimension,\nwith applications in domain adaptation (Courty et al., 2014, 2017), image recognition (Li et al.,\n2013; Rubner et al., 2000; Sandler and Lindenbaum, 2011), and word embedding (Alvarez-Melis and\nJaakkola, 2018; Grave et al., 2018). Its \ufb02exibility and simplicity have made it an attractive choice for\npractitioners and theorists alike, and its ubiquity as a machine learning tool continues to grow (see,\ne.g., Peyr\u00e9 et al., 2019; Kolouri et al., 2017, for surveys).\nMuch of the recent interest in optimal transport has been driven by algorithmic advances, chief\namong them the popularization of entropic regularization as a tool of solving large-scale OT problems\nquickly (Cuturi, 2013). Not only has this proposal been shown to yield near-linear-time algorithms for\nthe original optimal transport problem (Altschuler et al., 2017), but it also appears to possess useful\nstatistical properties which make it an attractive choice for machine learning applications (Rigollet and\nWeed, 2018; Genevay et al., 2017; Schiebinger et al., 2019; Montavon et al., 2016). For instance, in a\nrecent breakthrough work, Genevay et al. (2019) established that even though the empirical version of\n\u221a\nstandard OT suffers from the \u201ccurse of dimensionality\u201d (see, e.g. Dudley, 1969), the empirical version\nof entropic OT always converges at the parametric 1/\nn rate for compactly supported probability\nmeasures. This result suggests that entropic OT may be signi\ufb01cantly more useful than unregularized\nOT for inference tasks when the dimension is large. However, obtaining rigorous guarantees for\nthe performance of entropic OT in practice requires a more thorough understanding of its statistical\nbehavior.\n\n1.1 Summary of contributions\n\nWe prove new results on the relation between the population and empirical version of the entropic\ncost, that is, between S(P, Q) and S(Pn, Qn) (de\ufb01ned in Section 1.2, below). These results give\nthe \ufb01rst characterization of the large-sample behavior of entropic OT for unbounded probability\nmeasures in arbitrary dimension. Speci\ufb01cally, we obtain: (i) New sample complexity bounds on\nE|S(P, Q)\u2212 S(Pn, Qn)|: \ufb01rst, we improve on the results of Genevay et al. (2019) by an exponential\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffactor, and then, extend these to unbounded measures (Section 2). (ii) A central limit theorem\ncharacterizing the \ufb02uctuations S(Pn, Qn)\u2212 ES(Pn, Qn) when P and Q are subgaussian (Section 3).\nSuch a central limit theorem was previously only known for probability measures supported on\na \ufb01nite number of points (Bigot et al., 2017; Klatt et al., 2018). We use completely different\ntechniques, inspired by recent work of Del Barrio and Loubes (2019), to prove our theorem for\ngeneral subgaussian distributions.\nAs an application of our results, we show how entropic OT can be used to shed new light on the\nentropy estimation problem for random variables corrupted by subgaussian noise (Section 4). This\nproblem has gained recent interest in machine learning (Goldfeld et al., 2018, 2019) as a tool for\nobtaining a theoretically sound understanding of the Information Bottleneck Principle in deep learning\n(Tishby and Zaslavsky, 2015). We design and analyze a new estimator for this problem based on\nentropic OT.\nFinally, we provide simulations which give empirical validation for our theoretical claims (Section 5).\n\n1.2 Background and preliminaries\nLet P, Q \u2208 P(Rd) be two probability measures and let Pn and Qn be the empirical measures from\nthe independent samples {Xi}i\u2264n \u223c P n and {Yi}i\u2264n \u223c Qn. We de\ufb01ne the squared Wasserstein\ndistance between P and Q (Villani, 2008) as follows:\n\n(cid:20)(cid:90)\n\n(cid:21)\n\nW 2\n\n2 (P, Q) := inf\n\n\u03c0\u2208\u03a0(P,Q)\n\n1\n2\n\nX\u00d7Y\n\n(cid:107)x \u2212 y(cid:107)2 d\u03c0(x, y)\n\n,\n\n(1)\n\nwhere \u03a0(P, Q) is the set of all joint distributions with marginals equal to P and Q, respectively. We\nfocus on a entropy regularized version of the above cost (Cuturi, 2013; Peyr\u00e9 et al., 2019), de\ufb01ned as\n\n(cid:20)(cid:90)\n\n(cid:21)\n\n(cid:90)\n\n(cid:82) log d\u03b1\n\nS\u0001(P, Q) := inf\n\n\u03c0\u2208\u03a0(P,Q)\n\nX\u00d7Y\n\n(cid:107)x \u2212 y(cid:107)2 d\u03c0(x, y) + \u0001H(\u03c0|P \u2297 Q)\n\n1\n2\n\n,\n\n(2)\n\nwhere H(\u03b1|\u03b2) denotes the relative entropy between probability measures \u03b1 and \u03b2 de\ufb01end by\nd\u03b2 (x)d\u03b1(x) if \u03b1 (cid:28) \u03b2 and +\u221e otherwise. By rescaling the measures P and Q and the\nregularization parameter \u0001, it suf\ufb01ces to analyze the case \u0001 = 1, which we denote by S(P, Q). Note\n2(cid:107) \u00b7 (cid:107)2 in the de\ufb01nition of S\u0001(P, Q), since most of our\nthat we have considered the squared cost 1\nbounds heavily depend on this cost. However, more general costs c(x, y) may be considered, and\nindeed some of our results (e.g. Proposition 4) are stated for more general c(x, y). We leave a full\nanalysis of the general case to future work.\nThe general theory of entropic OT (Csisz\u00e1r, 1975) implies that S(P, Q) possesses a dual formulation:\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\nS(P, Q) =\n\nsup\n\nf\u2208L1(P ),g\u2208L1(Q)\n\nf (x) dP (x)+\n\ng(y) dQ(y)\u2212\n\nef (x)+g(y)\u2212 1\n\n2||x\u2212y||2\n\ndP (x)dQ(y)+1 ,\n\n(3)\nand that as long as P and Q have \ufb01nite second moments, the supremum is attained at a pair of optimal\npotentials (f, g) satisfying\n\nef (x)+g(y)\u2212 1\n\n2||x\u2212y||2\n\ndQ(y) = 1 P -a.s.,\n\nef (x)+g(y)\u2212 1\n\n2||x\u2212y||2\n\ndP (x) = 1 Q-a.s.\n\nConversely, any f \u2208 L1(P ), g \u2208 L1(Q) satisfying (4) are optimal potentials.\nWe focus throughout on subgaussian probability measures. We say that a distribution P \u2208 P(Rd)\nis \u03c32-subgaussian for \u03c3 \u2265 0 if EP e\n2d\u03c32 \u2264 C for any\nconstant C \u2265 2, then P is C\u03c32-subgaussian. Note that if P is subgaussian, then EP ev(cid:62)X < \u221e\nfor all v \u2208 Rd. Conversely, standard results (see, e.g., Vershynin, 2018) imply that our de\ufb01nition is\nsatis\ufb01ed if EP eu(cid:62)X \u2264 e(cid:107)u(cid:107)2\u03c32/2 for all u \u2208 Rd.\n\n2d\u03c32 \u2264 2. By Jensen\u2019s inequality, if EP e\n\n(cid:107)X(cid:107)2\n\n(cid:107)X(cid:107)2\n\n(4)\n\n2\n\n(cid:90)\n\n\f2 Sample complexity for the entropic transportation cost for general\n\nsubgaussian measures\n\nOne rigorous statistical bene\ufb01t of entropic OT is its sample complexity, i.e., the minimum number\nof samples required for the empirical entropic OT cost S(Pn, Qn) to be an accurate estimate of\nS(P, Q). As noted above, unregularized OT suffers from the curse of dimensionality: in general, the\n2 (P, Q) no faster than n\u22121/d for measures in Rd.\nWasserstein distance W 2\nStrikingly, Genevay et al. (2019) established that the statistical performance of the entropic OT cost\nis signi\ufb01cantly better. They show:1\nTheorem 1 (Genevay et al., 2019, Theorem 3). Let P and Q be two probability measures on a\nbounded domain in Rd of diameter D. Then\n\n2 (Pn, Qn) converges to W 2\n\n(cid:18)\n\n(cid:19) eD2/\u0001\u221a\n\nn\n\n,\n\n(5)\n\nEP,Q|S\u0001(P, Q) \u2212 S\u0001(Pn, Qn)| \u2264 KD,d\n\nsup\nP,Q\n\nwhere KD,d is a constant depending on D and d.\n\n1 +\n\n1\n\n\u0001(cid:98)d/2(cid:99)\n\nThis impressive result offers powerful evidence that entropic OT converges signi\ufb01cantly faster than its\nunregularized counterpart. The drawbacks of this result are that it applies only to bounded measures,\nand, perhaps more critically in applications, the rate scales exponentially in D and 1/\u0001, even in\ndimension 1. Therefore, while the qualitative message of Theorem 1 is clear, it does not offer useful\nquantitative bounds as soon as the measure is unbounded or lies in a set of large diameter.\nOur \ufb01rst theorem is a signi\ufb01cant sharpening of Theorem 1. We \ufb01rst state it for the case where \u0001 = 1.\nTheorem 2. If P and Q are \u03c32-subgaussian, then\n\nEP,Q|S(P, Q) \u2212 S(Pn, Qn)| \u2264 Kd(1 + \u03c3(cid:100)5d/2(cid:101)+6)\n\n1\u221a\nn\n\n.\n\n(6)\n\nIf we denote by P \u0001 and Q\u0001 the pushforwards of P and Q under the map x (cid:55)\u2192 \u0001\u22121/2x, then it is easy\nto see that\n\nS\u0001(P, Q) = \u0001S(P \u0001, Q\u0001) .\n\nWe immediately obtain the following corollary.\nCorollary 1. If P and Q are \u03c32-subgaussian, then\n\nEP,Q|S\u0001(P, Q) \u2212 S\u0001(Pn, Qn)| \u2264 Kd \u00b7 \u0001\n\n(cid:18)\n\n1 +\n\n(cid:19) 1\u221a\n\nn\n\n.\n\n\u03c3(cid:100)5d/2(cid:101)+6\n\u0001(cid:100)5d/4(cid:101)+3\n\nIf we compare Corollary 1 with Theorem 1, we note that the polynomial prefactor in Corollary 1\nhas higher degree than the one in Theorem 1, pointing to a potential weakness of our bound. On the\nother hand, the exponential dependence on D2/\u0001 has completely disappeared. Moreover, the brittle\nquantity D, \ufb01nite only for compactly supported measures, has been replaced by the more \ufb02exible\nsubgaussian variance proxy \u03c32.\nThe improvements in Theorem 2 are obtained via two different methods. First, a simple argument\nallows us to remove the exponential term and bound the desired quantity by an empirical process,\nas in Genevay et al. (2019). Much more challenging is the extension to measures with unbounded\nsupport. The proof technique of Genevay et al. (2019) relies on establishing uniform bounds on the\nderivatives of the optimal potentials, but this strategy cannot succeed if the support of P and Q is not\ncompact. We therefore employ a more careful argument based on controlling the H\u00f6lder norms of the\noptimal potentials on compact sets. A chaining bound completes our proof.\nIn Proposition 1 below (whose proof we defer to the supplement) we show that if (f, g) is a pair of\noptimal potentials for \u03c32-subgaussian distributions P and Q, then we may control the size of f and\nits derivatives.\nProposition 1. Let P and Q be \u03c32-subgaussian distributions. There exist optimal dual potentials\n(f, g) for P and Q such that for any multi-index \u03b1 with |\u03b1| = k,\n\n|D\u03b1(f \u2212 1\n2\n\n(cid:107) \u00b7 (cid:107)2)(x)| \u2264 Ck,d\n\n\u03c3k(\u03c3 + \u03c32)k\n\nk = 0\notherwise,\n\n(7)\n\n(cid:26) 1 + \u03c34\n\n1We have specialized their result to the squared Euclidean cost.\n\n3\n\n\fif (cid:107)x(cid:107) \u2264 \u221a\n\nd\u03c3, and\n\n|D\u03b1(f \u2212 1\n2\n\n(cid:107) \u00b7 (cid:107)2)(x)| \u2264 Ck,d\n\n\u221a\n\n(cid:26) 1 + (1 + \u03c32)(cid:107)x(cid:107)2\n\u03c3k((cid:112)\u03c3(cid:107)x(cid:107) + \u03c3(cid:107)x(cid:107))k\n\nk = 0\notherwise,\n\n(8)\n\nd\u03c3, where Ck,d is a constant depending only on k and d.\n\nif (cid:107)x(cid:107) >\nWe denote by F\u03c3 the set of functions satisfying the bounds (7) and (8). The following proposition\nshows that it suf\ufb01ces to control an empirical process indexed by this set.\nProposition 2. Let P , Q, and Pn be \u02dc\u03c32-subgaussian distributions, for a possibly random \u02dc\u03c3 \u2208 [0,\u221e).\nThen\n\n|S(Pn, Q) \u2212 S(P, Q)| \u2264 2 sup\nu\u2208F\u02dc\u03c3\n\n|EP u \u2212 EPn u| .\n\n(9)\n\nProof. We de\ufb01ne the operator A\u03b1,\u03b2(u, v) for the pair of probability measures (\u03b1, \u03b2) and functions\n(u, v) \u2208 L1(\u03b1) \u2297 L1(\u03b2) as:\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\nA\u03b1,\u03b2(u, v) =\n\nu(x) d\u03b1(x) +\n\nv(y) d\u03b2(y) \u2212\n\neu(x)+v(y)\u2212 1\n\n2||x\u2212y||2\n\nd\u03b1(x)d\u03b2(y) + 1 .\n\nDenote by (fn, gn) a pair of optimal potentials for (Pn, Q) and (f, g) for (P, Q), respectively. By\nProposition A.1 in the supplement, we can choose smooth optimal potentials (f, g) and (fn, gn) so\nthat the condition (4) holds for all x, y \u2208 Rd. Proposition 1 shows that f, fn \u2208 F\u02dc\u03c3.\nStrong duality implies that S(P, Q) = AP,Q(f, g) and S(Pn, Q) = APn,Q(fn, gn). Moreover, by\nthe optimality of (f, g) and (fn, gn) for their respective dual problems, we obtain\nAP,Q(fn, gn) \u2212 APn,Q(fn, gn) \u2264 AP,Q(f, g) \u2212 APn,Q(fn, gn) \u2264 AP,Q(f, g) \u2212 APn,Q(f, g) .\nWe conclude that\n\n|S(P, Q) \u2212 S(Pn, Q)| = |AP,Q(f, g) \u2212 APn,Q(fn, gn)|\n\n\u2264 |AP,Q(f, g) \u2212 APn,Q(f, g)| + |AP,Q(fn, gn) \u2212 APn,Q(fn, gn)| .\nto\n\n|AP,Q(f, g) \u2212 APn,Q(f, g)|\n\ndifferences\n\nbound\n\nthe\n\nand\n\nsuf\ufb01ces\n\ntherefore\n\nIt\n|AP,Q(fn, gn) \u2212 APn,Q(fn, gn)|.\n\nUpon de\ufb01ning h(x) :=(cid:82) eg(y)\u2212 1\n(cid:16)(cid:90)\n(cid:12)(cid:12)(cid:12)(cid:90)\n\nAP,Q(f, g)\u2212APn,Q(f, g) =\nSince (f, g) satisfy ef (x)h(x) = 1 for all x \u2208 Rd, the second term above vanishes. Therefore\n|AP,Q(f, g)\u2212APn,Q(f, g)| =\n\nf (x)(dP (x)\u2212 dPn(x))\n\nef (x)h(x)(dP (x)\u2212 dPn(x))\n\nu(x)(dP (x)\u2212 dPn(x))\n\n2||x\u2212y||2\n\ndQ(y) we have\nf (x)(dP (x)\u2212 dPn(x))\n\n+\n\n(cid:16)(cid:90)\n(cid:17)\n(cid:12)(cid:12)(cid:12) \u2264 sup\n\nu\u2208F\u02dc\u03c3\n\n(cid:12)(cid:12)(cid:12)(cid:90)\n\n.\n\n(cid:17)\n(cid:12)(cid:12)(cid:12) .\n\nAnalogously,\n\n|AP,Q(fn, gn) \u2212 APn,Q(fn, gn)| \u2264 sup\nu\u2208F\u02dc\u03c3\n\nThis proves the claim.\n\n(cid:12)(cid:12)(cid:12)(cid:90)\n\nu(x)(dP (x) \u2212 dPn(x))\n\n(cid:12)(cid:12)(cid:12) .\n\nProposition 2 can be extended to apply to simultaneously varying Pn and Qn.\nCorollary 2. Let P , Q, Pn, and Qn be \u02dc\u03c32-subgaussian distributions, where \u02dc\u03c3 \u2208 [0,\u221e) is possibly\nrandom. Then\n|S(Pn, Qn) \u2212 S(P, Q)| (cid:46) sup\nu\u2208F\u02dc\u03c3\n\nu(x)(dQ(x) \u2212 dQn(x))\n\nu(x)(dP (x) \u2212 dPn(x))\n\n(cid:12)(cid:12)(cid:12) + sup\n\n(cid:12)(cid:12)(cid:12)(cid:90)\n\n(cid:12)(cid:12)(cid:12)(cid:90)\n\nu\u2208F\u02dc\u03c3\n\n(cid:12)(cid:12)(cid:12)\n\nalmost surely.\n\n4\n\n\fProof. By the triangle inequality,\n\n|S(Pn, Qn) \u2212 S(P, Q)| \u2264 |S(P, Q) \u2212 S(Pn, Q)| + |S(Pn, Q) \u2212 S(Pn, Qn)| .\n\n(10)\n\nSince P , Q, Pn, and Qn are all \u02dc\u03c32-subgaussian, Proposition 2 can be applied to both terms.\nThe majority of our work goes into bounding the resulting empirical process. Let s \u2265 2. Fix a\nconstant Cs,d and denote by F s the set of functions satisfying\n\n|f (x)| \u2264 Cs,d(1 + (cid:107)x(cid:107)2)\n|D\u03b1f (x)| \u2264 Cs,d(1 + (cid:107)x(cid:107)s)\n\n\u2200\u03b1 : |\u03b1| \u2264 s .\n1+\u03c33s f \u2208 F s for all f \u2208 F\u03c3.\n\n1\n\n(11)\n(12)\n\nProposition 1 establishes that if Cs,d is large enough, then\nThe key result is the following covering number bound, whose proof we defer to the supplement.\nDenote by N (\u03b5,F s, L2(Pn)) the covering number with respect to the (random) metric L2(Pn)\n\nde\ufb01ned by (cid:107)f(cid:107)L2(Pn) =(cid:0) 1\n\nn\n\ni=1 f (Xi)2(cid:1)1/2.\n(cid:80)n\n\nProposition 3. Let s = (cid:100)d/2(cid:101) + 1. If P is \u03c32-subgaussian and Pn is an empirical distribution, then\nthere exists a random variable L depending on the sample X1, . . . , Xn satisfying EL \u2264 2 such that\n\nand\n\nlog N (\u03b5,F s, L2(Pn)) \u2264 CdLd/2s\u03b5\u2212d/s(1 + \u03c32d) ,\n\n(cid:107)f(cid:107)2\n\nL2(Pn) \u2264 Cd(1 + L\u03c34) .\n\nmax\nf\u2208F s\n\nWe can now prove Theorem 2.\n\nProof of Theorem 2. Let \u02dc\u03c3 be the in\ufb01mum over all \u03c4 > 0 such that P , Q, Pn, and Qn are all\n\u03c4 2-subgaussian. By Lemma A.2 in the supplement, \u02dc\u03c3 is \ufb01nite almost surely.\nBy Corollary 2,\n\nEP,Q|S(P, Q) \u2212 S(Pn, Qn)| (cid:46) E sup\nu\u2208F\u02dc\u03c3\n\nu(x)(dP (x) \u2212 dPn(x))\n\n+ E sup\nu\u2208F\u02dc\u03c3\n\nu(x)(dQ(x) \u2212 dQn(x))\n\nWe will show how to bound the \ufb01rst term, and the second will follow in exactly the same way.\n\nFor any set of functions F, we write (cid:107)P \u2212 Pn(cid:107)F = supu\u2208F ((cid:82) u(x)(dP (x)\u2212 dPn(x))). Recall that,\n\nfor s = (cid:100)d/2(cid:101) + 1, if u \u2208 F\u02dc\u03c3 then\n\n1+\u02dc\u03c33s u \u2208 F s. Therefore\n\n1\n\n(cid:12)(cid:12)(cid:12)(cid:90)\n(cid:12)(cid:12)(cid:12)(cid:90)\n\n(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12) .\n\nE(cid:107)P \u2212 Pn(cid:107)F\u03c3 \u2264 E(1 + \u02dc\u03c33s)(cid:107)P \u2212 Pn(cid:107)F s\n\n\u2264 (E(1 + \u02dc\u03c33s)2)1/2(E(cid:107)P \u2212 Pn(cid:107)2F s )1/2 .\n\nThen by Gin\u00e9 and Nickl (2016, Theorem 3.5.1 and Exercise 2.3.1), we have\n\nE(cid:107)P \u2212 Pn(cid:107)2F s (cid:46) 1\nn\n\nE\n\nmaxf\u2208F s (cid:107)f(cid:107)2\n\nL2 (Pn )\n\n(cid:32)(cid:90) (cid:113)\n(cid:32)(cid:90) Cd\n\n0\n\n\u221a\n\n(cid:33)2\n(cid:112)log 2N (\u03c4,F s, L2(Pn)) d\u03c4\n(cid:33)2\n\n1 + Ld/2s\u03c4\u2212d/s(1 + \u03c32d) d\u03c4\n\n(cid:113)\n\n(cid:33)2\n\nLd/4s\u03c4\u2212d/2s d\u03c4\n\n\u2264 Cd\n\n1\nn\n\nE\n\n0\n\n\u2264 Cd\n\n1\nn\n\n(1 + \u03c32d)E\n\n\u2264 Cd\n\n1\nn\n\n(1 + \u03c32d)E\n\n(1+L\u03c34)\n\n\u221a\n\n(cid:32)(cid:90) Cd\n(cid:104)\n(1 + L\u03c34)1\u2212d/2s(cid:105)\n\n(1+L\u03c34)\n\n0\n\n,\n\n5\n\nwhere in the last step we have used that d/2s < 1 so that \u03c4\u2212d/2s is integrable in a neighborhood of\nthe origin. Applying the bound on EL yields that this expression is bounded by Cd(1 + \u03c32d+4) 1\nn.\n\n\fLemma A.4 in the supplement shows that E \u02dc\u03c32k \u2264 Ck\u03c32k for all positive integers k. Combining\nthese bounds yields\n\nE(cid:107)P \u2212 Pn(cid:107)F\u03c3 \u2264 Cd(1 + \u03c33s)(1 + \u03c3d+2)\n\n1\u221a\nn\n\n,\n\nas desired.\n\n3 A central limit theorem for entropic OT\n\nThe results of Section 2 show that, for general subgaussian measures, the empirical quantity\nS(Pn, Qn) converges to S(P, Q) in expectation at the parametric rate. However, in order to use en-\ntropic OT for rigorous statistical inference tasks, much \ufb01ner control over the deviations of S(Pn, Qn)\nis needed, for instance in the form of asymptotic distributional limits. In this section, we accom-\nplish this goal by showing a central limit theorem (CLT) for S(Pn, Qn), valid for any subgaussian\nmeasures.\nBigot et al. (2017) and Klatt et al. (2018) have shown CLTs for entropic OT when the measures lie\nin a \ufb01nite metric space (or, equivalently, when P and Q are \ufb01nitely supported). Apart from being\nrestrictive in practice, these results do not shed much light on the general situation because OT on\n\ufb01nite metric spaces behaves quite differently from OT on Rd.2 Very recently, distributional limits for\ngeneral measures possessing 4 + \u03b4 moments have been obtained for unregularized OT by Del Barrio\nand Loubes (2019). Our proof follows their approach.\nWe prove the following.\nTheorem 3. Let X1, . . . Xn \u223c P be an i.i.d. sequence, and denote by (f,g) the optimal potentials\nin (4). If P is subgaussian, then\n\n\u221a\n\nn (S(Pn, Q) \u2212 E(S(Pn, Q))\n\nD\u2192 N (0, VarP (f (X))) ,\n\n(13)\n\n(14)\nLikewise, let X1, . . . , Xn \u223c P and Y1,\u223c Ym \u223c Q are two i.i.d. sequences independent of each other.\nAssume P and Q are both subgaussian. Denote \u03bb := limm,n\u2192\u221e n\n\nlim\nn\u2192\u221e n Var(S(Pn, Q)) = VarP (f (X)) .\n\nm+n \u2208 (0, 1).\n\n(S(Pn, Qm) \u2212 E(S(Pn, Qm))\n\nD\u2192 N (0, (1 \u2212 \u03bb) VarP (f (X1)) + \u03bb VarQ(g(Y1))) ,\n\nand\n\nThen(cid:114) mn\n\nm + n\n\n(15)\n\n(16)\n\nand\n\nlim\n\nm,n\u2192\u221e\n\nmn\n\nm + n\n\nVar(S(Pn, Qm)) = (1 \u2212 \u03bb) VarP (f (X)) + \u03bb VarQ(g(Y )).\n\nThe proof is deeply inspired by the method developed in Del Barrio and Loubes (2019) for the\nsquared Wasserstein distance, and we roughly follow the same strategy.\n\nProof of Theorem 3. The proof, in the one-sample case, proceeds as follows:\n\npotentials for (P, Q) uniformly on compact sets.\n\n(a) In Proposition A.2 we show that the optimal potentials for (Pn, Q) convergence to optimal\n\n(b) Letting Rn := S(Pn, Q) \u2212(cid:82) f (x)dPn(x), we show in Proposition A.3 in the supplement,\n(cid:82) f (x)dPn. Then, (13) and (14) are simply the limit statements (in distribution and L2,\n\n(c) The above convergence indicates S(Pn, Q) can be approximated by the linear quantity\n\nthat this uniform convergence implies that limn\u2192\u221e n Var(Rn) = 0.\n\nrespectively) applied to this linearization.\n\n2A thorough discussion of the behavior of unregularized OT for \ufb01nitely supported measures can be found in\n\nSommerfeld and Munk (2018) and Weed and Bach (2018).\n\n6\n\n\fWe omit the proof of the two-sample case as the changes to the argument (see Theorem 3.3. in\nDel Barrio and Loubes (2019), for the squared Wasserstein distance) adapt in a straightforward way\nto the entropic case.\n\n4 Application to entropy estimation\n\nIn this section, we give an application of entropic OT to the problem of entropy estimation. First, in\nProposition 4 we establish a new relation between entropic OT and the differential entropy of the\nconvolution of two measures. Then, as a corollary of this and the previous sections results we prove\nTheorem 4, stating that entropic OT provides us with a novel estimator for the differential entropy of\nthe (independent) sum of a subgaussian random variable and a gaussian random variable, and for\nwhich performance guarantees are available.\nThroughout this section \u03bd denotes a translation invariant measure. Whenever P has a density p with\n\nrespect to \u03bd, we de\ufb01ne its \u03bd-differential entropy as h(P ) := \u2212(cid:82) p(x) log p(x)d\u03bd(x) = \u2212H(P|\u03bd).\n\nThe following proposition links the differential entropy of a convolution with the entropic cost.\nProposition 4. Let \u03a6g be the measure with \u03bd-density \u03c6g(y) = Z\u22121\ng e\u2212g(y) for a smooth g (Zg is the\nnormalizing constant), and de\ufb01ne Q = P \u2217 \u03a6g, with P \u2208 P(Rd) arbitrary. The \u03bd-density of Q, q(y),\nsatis\ufb01es\n\n(cid:90)\n\n(cid:90)\n\nq(y) =\n\n\u03c6g(y \u2212 x)dP (x) =\n\nZ\u22121\ng e\u2212g(y\u2212x)dP (x).\n\nConsider the cost function c(x, y) := g(x\u2212 y) (not necessarily quadratic). Then, the optimal entropic\ntransport cost and differential entropy are linked through\n\nh(P \u2217 \u03a6g) = S(P, P \u2217 \u03a6g) + log(Zg).\n\n(17)\n\nProof. De\ufb01ne a more general entropic transportation cost involving the generic c and probability\nmeasures \u03b1, \u03b2 3:\n\nS\u03b1\u2297\u03b2(P, Q) := inf\n\n\u03c0\u2208\u03a0(P,Q)\n\nc(x, y)d\u03c0(x, y) + H(\u03c0|\u03b1 \u2297 \u03b2)\n\n.\n\n(18)\n\n(cid:20)(cid:90)\n\nObserve we may re-write (18) as\n\n(cid:20)(cid:90)\n\n(cid:21)\n\n(cid:21)\n\nS\u03b1\u2297\u03b2(P, Q) =\n\nc(x, y)d\u03c0(x, y) + H(\u03c0|P \u2297 Q)\n\n+ H(P \u2297 Q|\u03b1 \u2297 \u03b2)\n\ninf\n\n\u03c0\u2208\u03a0(P,Q)\n\nX\u00d7Y\n\n= S(P, Q) + H(P \u2297 Q|\u03b1 \u2297 \u03b2).\n\nAdditionally, it can be veri\ufb01ed an alternative representation for (18) is the following\n\nS\u03b1\u2297\u03b2(P, Q) = inf\n\n\u03c0\u2208\u03a0(P,Q)\n\nH\n\n\u03c0\n\n\u2212 log(Z),\n\nwhere Z is the number making \u039b := Z\u22121e\u2212c\u03b1 \u2297 \u03b2 a bona \ufb01de probability measure.\nNow, take \u03b1 = P , \u03b2 = \u03bd and Q = P \u2217 \u03a6g in the above expressions. For these choices we have\nZ = Zg. Indeed, by the translation invariance of \u03bd, we have\n\n(19)\n\n(20)\n\n(cid:18)\n\n(cid:19)\n\n(cid:12)(cid:12)(cid:12)(cid:12)Z\u22121e\u2212c\u03b1 \u2297 \u03b2\n(cid:90) (cid:18)(cid:90)\n(cid:90) (cid:18)(cid:90)\n(cid:90)\n\nZgdP (x) = Zg.\n\n(cid:19)\n\n(cid:19)\n\ne\u2212g(y\u2212x)d\u03bd(y)\n\ndP (x)\n\ne\u2212g(y)d\u03bd(y)\n\ndP (x)\n\n(cid:90)(cid:90)\n\nZ =\n\ne\u2212c(x,y)dP (x)d\u03bd(y) =\n\n=\n\n=\n\n3Notice \u03b1 \u2297 \u03b2 need not be probability measures for the relative entropy H(\u00b7|\u03b1 \u2297 \u03b2) to make sense. In\n\nL\u00e9onard (2014) it is argued it suf\ufb01ces that this product is \u03c3-\ufb01nite.\n\n7\n\n\f(a) If m = n,\n\n(b) The limit\n\n(cid:114) mn\n\nm + n\n\n(cid:19)\n\n(cid:18) 1\u221a\n\n.\n\nn\n\nEP|\u02c6h(Q) \u2212 h(Q)| \u2264 O\n\nsup\nP\n\n(cid:16)\u02c6h(Q) \u2212 E(\u02c6h(Q)\n\n(cid:17) D\u2192 N (0, \u03bb VarQ(log q(Y )))\n\nThen, d\u039b(x, y) = dP (x)\u03c6g(y \u2212 x)d\u03bd(y), and by marginalization we deduce \u039b \u2208 \u03a0(P, P \u2217 \u03a6g).\nTherefore, the right side of (20) equals H(\u039b|\u039b) \u2212 log Zg = \u2212 log Zg. Finally, we combine (19) and\n(20) to obtain\n\n\u2212 log Zg = S(P, P \u2217 \u03a6g) + H (P \u2297 (P \u2217 \u03a6g)|P \u2297 \u03bd) ,\n\nand achieve the \ufb01nal conclusion after noting that\n\nH(P \u2297 (P \u2217 \u03a6g)|P \u2297 \u03bd) = H(P|P ) + H (P \u2217 \u03a6g|\u03bd) = H (P \u2217 \u03a6g|\u03bd) = \u2212h(P \u2217 \u03a6g).\n\nNow we can state the following theorem.\nTheorem 4. Let P be subgaussian, and \u03a6g = N (0, \u0001Id). Denote Q = P \u2217 \u03a6g the distribution\nof the sum of an independent samples from P and \u03a6g, and de\ufb01ne the plug in estimator \u02c6h(Q) =\nS(Pn, Qm) + log Zg where Pn and Qm are independent samples from P and Q. Then,\n\n(21)\n\nholds, where \u03bb = limm,n\u2192\u221e n\n\u03bb VarQ(log q(Y )).\n\nm+n . Moreover,\n\nlimm,n\u2192\u221e mn\n\nm+n Var(\u02c6h(Q)) =\n\nProof. (a) is a simple re-statement of Theorem 2 in the light of Proposition 4. (b) is a re-statement of\nTheorem 3, after noting in this case the optimal potentials are (f, g) = (\u2212 log Zg,\u2212 log q).\n\n\u221a\n\nThe rate 1/\n(2019) (see also Weed, 2018), but this estimator lacks distributional limits.\n\nn in Theorem 4 is also achieved by a different estimator proposed by Goldfeld et al.\n\nFigure 1: Top row: ES(Pn, Qn) as a function of n \u2208 {1e3, 2e3, 5e3, 1e4, 1.5e4}, computed from\n16, 000 repetitions for each value of n. The shading corresponds to one standard deviation of\n(cid:113) nn\nS(Pn, Qn) \u2212 ES(Pn, Qn), assuming the asymptotics of Theorem 3 are valid. Error bars are one\nsample standard deviation long on each side. Both x and y axes are in logarithmic scale. Bottom\nn+n (S(Pn, Qn) \u2212 ES(Pn, Qn))) when n = 1.5e4. Ground truth (numerical\nrow: histograms of\nintegration) is shown with black solid lines.\n\n8\n\n\f5 Empirical results\n\nWe provide empirical evidence supporting and illustrating our theoretical \ufb01ndings. We focus on the\nentropy estimation problem because there are closed form expressions for the potentials (see Theorem\n4), and because it allows a comparison with the estimator studied in (Goldfeld et al., 2019) 4.\nSpeci\ufb01cally, consider X \u223c P = 1\n2 (N (1d, Id) + N (\u22121d, Id)), the mixture of the gaussians centered\nat 1d := (1, . . . , 1) and \u22121d. We aim to estimate the entropy of the new mixture Q = P \u2217 \u03a6g.\nFigure 1, top, shows the convergence of ES(Pn, Qn) to S(P, Q). Consistent with the bound in\nTheorem 2 and Corollary 1, S(Pn, Qn) is a worse estimator for S(P, Q) when d is large or the\nregularization parameter is small. We also plot the predicted (shading) and actual (bars) \ufb02uctuations\nof S(Pn, Qn) around its mean. Though the CLT holds only in the asymptotic limit, these experiments\nreveal that the empirical \ufb02uctuations in the \ufb01nite-n regime are broadly consistent with the predictions\nof the CLT. Figure 1, bottom, shows that the empirical distribution of the rescaled \ufb02uctuations is an\nexcellent match for the predicted normal distribution.\nIn Figure 2 we compare the performance between entropic OT-based estimators from Theorem 4 and\n\u02c6hm.g.(Q), the one from (Goldfeld et al., 2019), where h(P \u2217 \u03a6g) is estimated as the entropy of the\nmixture of gaussians Pn \u2217 \u03a6g, in turn approximated by Monte Carlo integration. We consider two\nOT-based estimators, \u02c6hind(Q) where Pn, Qn are completely independent (i.e., the one used for Figure\n1), and \u02c6hpaired(Q) where samples Qn are drawn by adding gaussian noise to Pn. Observe that our\nsample complexity and CLT results are only available for \u02c6hind(Q).\nResults show a clear pattern of dominance, with E\u02c6hpaired(Q) achieving the fastest convergence. The\nmain caveat is the extra memory cost: while \u02c6hm.g.(Q) can be computed sequentially with each opera-\ntion requiring O(n) memory, in the most naive implementation (used here) both \u02c6hpaired(Q), \u02c6hind(Q)\ndemand O(n2) space for storing the matrix Di,j = e\u2212||xi\u2212yj||2/2\u0001, to which the Sinkhorn algorithm\nis applied. This memory requirement might be alleviated with the use of stochastic methods (Genevay\net al., 2016; Bercu and Bigot, 2018).\nWe leave for future work both the implementation of more scalable methods for entropic OT, and a\ndetailed theoretical analysis of different entropic OT-based estimators (e.g. \u02c6hpaired(Q) v.s. \u02c6hind(Q))\nthat may bring about a better understanding of their observed substantial differences. Additionally, in\nfuture work we will explore extensions of our results beyond the subgaussian case, and provide lower\nbounds as in Goldfeld et al. (2019).\n\nFigure 2: Comparison between E\u02c6hind(Q), E\u02c6hpaired(Q), E\u02c6hm.g.(Q). Details are the same as in Figure\n1.\n\nReferences\n\nAltschuler, J., Weed, J., and Rigollet, P. (2017). Near-linear time approximation algorithms for\noptimal transport via sinkhorn iteration. In Advances in Neural Information Processing Systems,\npages 1961\u20131971.\n\n4We don\u2019t present comparisons with the recent estimator presented in Berrett et al. (2019). This general\nn\u2212consistent and a CLT is available without a centering constant (as our ES(Pn, Qn)).\n\npurpose estimator is\nHowever, empirical results show the one in Goldfeld et al. (2019) performs much better.\n\n\u221a\n\n9\n\n\fAlvarez-Melis, D. and Jaakkola, T. S. (2018). Gromov-Wasserstein alignment of word embedding\nIn Proceedings of the 2018 Conference on Empirical Methods in Natural Language\n\nspaces.\nProcessing, Brussels, Belgium, October 31 - November 4, 2018, pages 1881\u20131890.\n\nBercu, B. and Bigot, J. (2018). Asymptotic distribution and convergence rates of stochastic al-\ngorithms for entropic optimal transportation between probability measures. arXiv preprint\narXiv:1812.09150.\n\nBerrett, T. B., Samworth, R. J., Yuan, M., et al. (2019). Ef\ufb01cient multivariate entropy estimation via\n\nk-nearest neighbour distances. The Annals of Statistics, 47(1):288\u2013318.\n\nBigot, J., Cazelles, E., and Papadakis, N. (2017). Central limit theorems for sinkhorn divergence\nbetween probability distributions on \ufb01nite spaces and statistical applications. arXiv preprint\narXiv:1711.08947.\n\nCourty, N., Flamary, R., and Tuia, D. (2014). Domain adaptation with regularized optimal transport.\n\nIn ECML PKDD, pages 274\u2013289.\n\nCourty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. (2017). Optimal transport for domain\n\nadaptation. IEEE Trans. Pattern Anal. Mach. Intell., 39(9):1853\u20131865.\n\nCsisz\u00e1r, I. (1975). I-divergence geometry of probability distributions and minimization problems.\n\nThe Annals of Probability, pages 146\u2013158.\n\nCuturi, M. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in\n\nneural information processing systems, pages 2292\u20132300.\n\nDel Barrio, E. and Loubes, J.-M. (2019). Central limit theorems for empirical transportation cost in\n\ngeneral dimension. The Annals of Probability, 47(2):926\u2013951.\n\nDudley, R. M. (1969). The speed of mean Glivenko-Cantelli convergence. Ann. Math. Statist,\n\n40:40\u201350.\n\nGenevay, A., Chizat, L., Bach, F., , Cuturi, M., and Peyr\u00e9, G. (2019). Sample complexity of sinkhorn\ndivergences. In Proceedings of the 22nd International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS).\n\nGenevay, A., Cuturi, M., Peyr\u00e9, G., and Bach, F. (2016). Stochastic optimization for large-scale\n\noptimal transport. In Advances in Neural Information Processing Systems, pages 3440\u20133448.\n\nGenevay, A., Peyr\u00e9, G., and Cuturi, M. (2017). Learning generative models with sinkhorn divergences.\n\narXiv preprint arXiv:1706.00292.\n\nGin\u00e9, E. and Nickl, R. (2016). Mathematical foundations of in\ufb01nite-dimensional statistical models.\nCambridge Series in Statistical and Probabilistic Mathematics, [40]. Cambridge University Press,\nNew York.\n\nGoldfeld, Z., Berg, E. v. d., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., and Polyanskiy,\n\nY. (2018). Estimating information \ufb02ow in neural networks. arXiv preprint arXiv:1810.05728.\n\nGoldfeld, Z., Greenewald, K., Polyanskiy, Y., and Weed, J. (2019). Convergence of smoothed\n\nempirical measures with applications to entropy estimation. arXiv preprint arXiv:1905.13576.\n\nGrave, E., Joulin, A., and Berthet, Q. (2018). Unsupervised alignment of embeddings with Wasser-\n\nstein procrustes. arXiv preprint arXiv:1805.11222.\n\nKlatt, M., Tameling, C., and Munk, A. (2018). Empirical regularized optimal transport: Statistical\n\ntheory and applications. arXiv preprint arXiv:1810.09880.\n\nKolouri, S., Park, S. R., Thorpe, M., Slepcev, D., and Rohde, G. K. (2017). Optimal mass transport:\nSignal processing and machine-learning applications. IEEE Signal Process. Mag., 34(4):43\u201359.\n\nL\u00e9onard, C. (2014). Some properties of path measures. In S\u00e9minaire de Probabilit\u00e9s XLVI, pages\n\n207\u2013230. Springer.\n\n10\n\n\fLi, P., Wang, Q., and Zhang, L. (2013). A novel earth mover\u2019s distance methodology for image\n\nmatching with Gaussian mixture models. In ICCV, pages 1689\u20131696.\n\nMontavon, G., M\u00fcller, K.-R., and Cuturi, M. (2016). Wasserstein training of restricted boltzmann\n\nmachines. In Advances in Neural Information Processing Systems, pages 3718\u20133726.\n\nPeyr\u00e9, G., Cuturi, M., et al. (2019). Computational optimal transport. Foundations and Trends R(cid:13) in\n\nMachine Learning, 11(5-6):355\u2013607.\n\nRigollet, P. and Weed, J. (2018). Entropic optimal transport is maximum-likelihood deconvolution.\n\nComptes rendus Math\u00e9matique, 356(11\u201312).\n\nRubner, Y., Tomasi, C., and Guibas, L. J. (2000). The earth mover\u2019s distance as a metric for image\n\nretrieval. International journal of computer vision, 40(2):99\u2013121.\n\nSandler, R. and Lindenbaum, M. (2011). Nonnegative matrix factorization with earth mover\u2019s distance\nmetric for image analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n33(8):1590\u20131602.\n\nSchiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subramanian, V., Solomon, A., Liu, S., Lin, S.,\nBerube, P., Lee, L., et al. (2019). Reconstruction of developmental landscapes by optimal-transport\nanalysis of single-cell gene expression sheds light on cellular reprogramming. Cell. To appear.\n\nSommerfeld, M. and Munk, A. (2018). Inference for empirical Wasserstein distances on \ufb01nite spaces.\n\nJ. R. Stat. Soc. Ser. B. Stat. Methodol., 80(1):219\u2013238.\n\nTishby, N. and Zaslavsky, N. (2015). Deep learning and the information bottleneck principle. In\n\n2015 IEEE Information Theory Workshop (ITW), pages 1\u20135. IEEE.\n\nVershynin, R. (2018). High-dimensional probability: An introduction with applications in data\n\nscience, volume 47. Cambridge University Press.\n\nVillani, C. (2008). Optimal transport: old and new, volume 338. Springer Science & Business\n\nMedia.\n\nWeed, J. (2018). Sharper rates for estimating differential entropy under gaussian convolutions.\n\nTechnical report, Massachusetts Institute of Technology.\n\nWeed, J. and Bach, F. (2018). Sharp asymptotic and \ufb01nite-sample rates of convergence of empirical\n\nmeasures in Wasserstein distance. Bernoulli. To appear.\n\n11\n\n\f", "award": [], "sourceid": 2555, "authors": [{"given_name": "Gonzalo", "family_name": "Mena", "institution": "Harvard"}, {"given_name": "Jonathan", "family_name": "Niles-Weed", "institution": "NYU"}]}