{"title": "Massively scalable Sinkhorn distances via the Nystr\u00f6m method", "book": "Advances in Neural Information Processing Systems", "page_first": 4427, "page_last": 4437, "abstract": "The Sinkhorn \"distance,\" a variant of the Wasserstein distance with entropic regularization, is an increasingly popular tool in machine learning and statistical inference. However, the time and memory requirements of standard algorithms for computing this distance grow quadratically with the size of the data, rendering them prohibitively expensive on massive data sets. In this work, we show that this challenge is surprisingly easy to circumvent: combining two simple techniques\u2014the Nystr\u00f6m method and Sinkhorn scaling\u2014provably yields an accurate approximation of the Sinkhorn distance with significantly lower time and memory requirements than other approaches. We prove our results via new, explicit analyses of the Nystr\u00f6m method and of the stability properties of Sinkhorn scaling. We validate our claims experimentally by showing that our approach easily computes Sinkhorn distances on data sets hundreds of times larger than can be handled by other techniques.", "full_text": "Massively Scalable Sinkhorn Distances\n\nvia the Nystr\u00f6m Method\n\nJason Altschuler\n\nMIT\n\njasonalt@mit.edu\n\nFrancis Bach\n\nINRIA - ENS - PSL\n\nfrancis.bach@inria.fr\n\nAlessandro Rudi\nINRIA - ENS - PSL\n\nalessandro.rudi@inria.fr\n\nJonathan Niles-Weed\n\nNYU\n\njnw@cims.nyu.edu\n\nAbstract\n\nThe Sinkhorn \u201cdistance,\u201d a variant of the Wasserstein distance with entropic regular-\nization, is an increasingly popular tool in machine learning and statistical inference.\nHowever, the time and memory requirements of standard algorithms for computing\nthis distance grow quadratically with the size of the data, making them prohibitively\nexpensive on massive data sets. In this work, we show that this challenge is sur-\nprisingly easy to circumvent: combining two simple techniques\u2014the Nystr\u00f6m\nmethod and Sinkhorn scaling\u2014provably yields an accurate approximation of the\nSinkhorn distance with signi\ufb01cantly lower time and memory requirements than\nother approaches. We prove our results via new, explicit analyses of the Nystr\u00f6m\nmethod and of the stability properties of Sinkhorn scaling. We validate our claims\nexperimentally by showing that our approach easily computes Sinkhorn distances\non data sets hundreds of times larger than can be handled by other techniques.\n\n1\n\nIntroduction\n\nOptimal transport is a fundamental notion in probability theory and geometry [42], which has\nrecently attracted a great deal of interest in the machine learning community as a tool for image\nrecognition [26, 35], domain adaptation [11, 12], and generative modeling [5, 9, 20], among many\nother applications [see, e.g., 25, 31].\nThe growth of this \ufb01eld has been fueled in part by computational advances, many of them stemming\nfrom an in\ufb02uential proposal of Cuturi [13] to modify the de\ufb01nition of optimal transport to include\nan entropic penalty. The resulting quantity, which Cuturi [13] called the Sinkhorn \u201cdistance\u201d1\nafter Sinkhorn [38], is signi\ufb01cantly faster to compute than its unregularized counterpart. Though\noriginally attractive purely for computational reasons, the Sinkhorn distance has since become an\nobject of study in its own right because it appears to possess better statistical properties than the\nunregularized distance both in theory and in practice [21, 29, 31, 34, 36]. Computing this distance as\nquickly as possible has therefore become an area of active study.\nWe brie\ufb02y recall the setting. Let p and q be probability distributions supported on at most n points\nin Rd. We denote by M(p, q) the set of all couplings between p and q, and for any P \u2208 M(p, q),\nwe denote by H(P ) its Shannon entropy. (See Section 2.1 for full de\ufb01nitions.) The Sinkhorn distance\n\n1We use quotations since it is not technically a distance; see [13, Section 3.2] for details. The quotes are\n\ndropped henceforth.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbetween p and q is de\ufb01ned as\n\nW\u03b7(p, q) := min\n\nP\u2208M(p,q)\n\n(cid:88)\n\nij\n\nPij(cid:107)xi \u2212 xj(cid:107)2\n\n2 \u2212 \u03b7\u22121H(P ) ,\n\n(1)\n\nfor a parameter \u03b7 > 0. We stress that we use the squared Euclidean cost in our formulation of the\nSinkhorn distance. This choice of cost\u2014which in the unregularized case corresponds to what is\ncalled the 2-Wasserstein distance [42]\u2014is essential to our results, and we do not consider other costs\nhere. The squared Euclidean cost is among the most common in applications [9, 12, 16, 21, 36].\nMany algorithms to compute W\u03b7(p, q) are known. Cuturi [13] showed that a simple iterative proce-\ndure known as Sinkhorn\u2019s algorithm had very fast performance in practice, and later experimental\nwork has shown that greedy and stochastic versions of Sinkhorn\u2019s algorithm perform even better in\ncertain settings [3, 20]. These algorithms are notable for their versatility: they provably succeed for\nany bounded, nonnegative cost. On the other hand, these algorithms are based on matrix manipula-\ntions involving the n \u00d7 n cost matrix C, so their running times and memory requirements inevitably\nscale with n2. In experiments, Cuturi [13] and Genevay et al. [20] showed that these algorithms could\nreliably be run on problems of size n \u2248 104.\nAnother line of work has focused on obtaining better running times when the cost matrix has special\nstructure. A preeminent example is due to Solomon et al. [40], who focus on the Wasserstein distance\non a compact Riemannian manifold, and show that an approximation to the entropic regularized\nWasserstein distance can be obtained by repeated convolution with the heat kernel on the domain.\nSolomon et al. [40] also establish that for data supported on a grid in Rd, signi\ufb01cant speedups are\npossible by decomposing the cost matrix into \u201cslices\u201d along each dimension [see 31, Remark 4.17].\nWhile this approach allowed Sinkhorn distances to be computed on signi\ufb01cantly larger problems\n(n \u2248 108), it does not extend to non-grid settings. Other proposals include using random sampling\nof auxiliary points to approximate semi-discrete costs [41] or performing a Taylor expansion of the\nkernel matrix in the case of the squared Euclidean cost [4]. These approximations both focus on the\n\u03b7 \u2192 \u221e regime, when the regularization term in (1) is very small, and do not apply to the moderately\nregularized case \u03b7 = O(1) typically used in practice. Moreover, the running time of these algorithms\nscales exponentially in the ambient dimension, which can be very large in applications.\n\n1.1 Our contributions\n\nWe show that a simple algorithm can be used to approximate W\u03b7(p, q) quickly on massive data sets.\nOur algorithm uses only known tools, but we give novel theoretical guarantees that allow us to show\nthat the Nystr\u00f6m method combined with Sinkhorn scaling provably yields a valid approximation\nalgorithm for the Sinkhorn distance at a fraction of the running time of other approaches.\nWe establish two theoretical results of independent interest: (i) New Nystr\u00f6m approximation results\nshowing that instance-adaptive low-rank approximations to Gaussian kernel matrices can be found\nfor data lying on a low-dimensional manifold (Section 3.) (ii) New stability results about Sinkhorn\nprojections, establishing that a suf\ufb01ciently good approximation to the cost matrix can be used\n(Section 4.)\n\n1.2 Prior work\n\nComputing the Sinkhorn distance ef\ufb01ciently is a well studied problem in a number of communities.\nThe Sinkhorn distance is so named because, as was pointed out by Cuturi [13], there is an extremely\nsimple iterative algorithm due to Sinkhorn [38] which converges quickly to a solution to (1). This\nalgorithm, which we call Sinkhorn scaling, works very well in practice and can be implemented using\nonly matrix-vector products, which makes it easily parallelizable. Sinkhorn scaling has been analyzed\nmany times [3, 14, 17, 24, 27], and forms the basis for the \ufb01rst algorithms for the unregularized\noptimal transport problem that run in time nearly linear in the size of the cost matrix [3, 14]. Greedy\nand stochastic algorithms related to Sinkhorn scaling with better empirical performance have also\nbeen explored [3, 20]. Another in\ufb02uential technique, due to Solomon et al. [40], exploits the fact\nthat, when the distributions are supported on a grid, Sinkhorn scaling performs extremely quickly by\ndecomposing the cost matrix along lower-dimensional slices.\nOther algorithms have sought to solve (1) by bypassing Sinkhorn scaling entirely. Blanchet et al. [8]\nproposed to solve (1) directly using second-order methods based on fast Laplacian solvers [2, 10].\n\n2\n\n\fBlanchet et al. [8] and Quanrud [32] have noted a connection to packing linear programs, which can\nalso be exploited to yield near-linear time algorithms for unregularized transport distances.\nOur main algorithm relies on constructing a low-rank approximation of a Gaussian kernel matrix\nfrom a small subset of its columns and rows. Computing such approximations is a problem with an\nextensive literature in machine learning, where it has been studied under many different names, e.g.,\nNystr\u00f6m method [44], sparse greedy approximations [39], incomplete Cholesky decomposition [15],\nGram-Schmidt orthonormalization [37] or CUR matrix decompositions [28]. The approximation\nproperties of these algorithms are now well understood [1, 6, 22, 28]; however, in this work, we\nrequire signi\ufb01cantly more accurate bounds than are available from existing results as well as adaptive\nbounds for low-dimensional data. To establish these guarantees, we follow an approach based on\napproximation theory [see, e.g., 7, 33, 43], which consists of analyzing interpolation operators for the\nreproducing kernel Hilbert space corresponding to the Gaussian kernel.\nFinally, this paper adds to recent work proposing the use of low-rank approximation for Sinkhorn\nscaling [4, 41]. We improve upon those papers in several ways. First, although we also exploit the\nidea of a low-rank approximation to the kernel matrix, we do so in a more sophisticated way that\nallows for automatic adaptivity to data with low-dimensional structure. These new approximation\nresults are the key to our adaptive algorithm, and this yields a signi\ufb01cant improvement in practice.\nSecond, the analyses of Altschuler et al. [4] and Tenetov et al. [41] only yield an approximation to\nW\u03b7(p, q) when \u03b7 \u2192 \u221e. In the moderately regularized case when \u03b7 = O(1), which is typically used\nin practice, neither the work of Altschuler et al. [4] nor of Tenetov et al. [41] yields a rigorous error\nguarantee.\n\n2 Main Result\n\n2.1 Preliminaries and notation\n\n\u2206n := {v \u2208 Rn(cid:62)0 :(cid:80)n\n\nProblem setup. Throughout, p and q are two probability distributions supported on a set X :=\n{x1, . . . , xn} of points in Rd, with (cid:107)xi(cid:107)2 (cid:54) R for all i \u2208 [n] := {1, . . . , n}. We de\ufb01ne the\ncost matrix C \u2208 Rn\u00d7n by Cij = (cid:107)xi \u2212 xj(cid:107)2\n2. We identify p and q with vectors in the simplex\ni=1 vi = 1} whose entries denote the weight each distribution gives to the\npoints of X. We denote by M(p, q) the set of couplings between p and q, identi\ufb01ed with the set\nof P \u2208 Rn\u00d7n(cid:62)0\nsatisfying P 1 = p and P (cid:62)1 = q, where 1 denotes the all-ones vector in Rn. The\nShannon entropy of a non-negative matrix P \u2208 Rn\u00d7n(cid:62)0\n, where\nwe adopt the standard convention that 0 log 1\nOur goal is to approximate the Sinkhorn distance (1) to some additive accuracy \u03b5 > 0. By strict\nconvexity, this optimization problem has a unique minimizer, which we denote henceforth by P \u03b7.\nFor shorthand, in the sequel we write\n\nis denoted H(P ) :=(cid:80)\n\n0 = 0.\n\nij Pij log 1\nPij\n\nVM (P ) := (cid:104)M, P(cid:105) \u2212 \u03b7\u22121H(P ),\n\nfor a matrix M \u2208 Rn\u00d7n. In particular, we have W\u03b7(p, q) = minP\u2208M(p,q) VC(P ). For the purpose\nof simplifying some bounds, we assume throughout that n (cid:62) 2, \u03b7 \u2208 [1, n], R (cid:62) 1, \u03b5 (cid:54) 1.\n\nSinkhorn scaling. Our approach is based on Sinkhorn scaling, an algorithm due to Sinkhorn [38]\nand popularized for optimal transport by Cuturi [13]. We recall the following fundamental de\ufb01nition.\nDe\ufb01nition 1. Given p, q \u2208 \u2206n and K \u2208 Rn\u00d7n with positive entries, the Sinkhorn projec-\nM(p,q)(K) of K onto M(p, q) is the unique matrix in M(p, q) of the form D1KD2 for\ntion \u03a0S\npositive diagonal matrices D1 and D2.\nSince p and q remain \ufb01xed throughout, we abbreviate \u03a0S\nthe feasible set M(p, q) explicit.\nProposition 1 (45). Let K have strictly positive entries, and let log K be the matrix de\ufb01ned by\n(log K)ij := log(Kij). Then\n\u03a0S\nM(p,q)(K) = argmin\nP\u2208M(p,q)\n\nM(p,q) by \u03a0S except when we want to make\n\n(cid:104)\u2212 log K, P(cid:105) \u2212 H(P ) .\n\n3\n\n\fNote that the strict convexity of \u2212H(P ) and the compactness of M(p, q) implies that the minimizer\nexists and is unique.\n\nThis yields the following simple but key connection between Sinkhorn distances and Sinkhorn scaling.\nCorollary 1.\n\nP \u03b7 = \u03a0S\n\nM(p,q)(K) ,\n\nwhere K is de\ufb01ned by Kij = e\u2212\u03b7Cij .\nNotation. We de\ufb01ne the probability simplices \u2206n := {p \u2208 Rn(cid:62)0 : p(cid:62)1 = 1} and \u2206n\u00d7n := {P \u2208\n: 1(cid:62)P 1 = 1}. Elements of \u2206n\u00d7n will be called joint distributions. The Kullback-Leibler\nRn\u00d7n(cid:62)0\n\ndivergence between two joint distributions P and Q is KL(P(cid:107)Q) :=(cid:80)\n(cid:107)A(cid:107)1 its entrywise (cid:96)1 norm (i.e., (cid:107)A(cid:107)1 :=(cid:80)\n\nThroughout the paper, all matrix exponentials and logarithms will be taken entrywise, i.e., (eA)ij :=\neAij and (log A)ij := log Aij for A \u2208 Rn\u00d7n. Given a matrix A, we denote by (cid:107)A(cid:107)op its operator\nnorm (i.e., largest singular value), by (cid:107)A(cid:107)\u2217 its nuclear norm (i.e., the sum of its singular values), by\nij |Aij|), and by (cid:107)A(cid:107)\u221e its entrywise (cid:96)\u221e norm (i.e.,\n(cid:107)A(cid:107)\u221e := maxij |Aij|). We abbreviate \u201cpositive semide\ufb01nite\u201d by \u201cPSD.\u201d\nThe notation f = O(g) means that f (cid:54) Cg for some universal constant C, and g = \u2126(f ) means\nf = O(g). The notation \u02dcO(\u00b7) omits polylogarithmic factors depending on R, \u03b7, n, and \u03b5.\n\nij Pij log Pij\nQij\n\n.\n\n2.2 Main result and proposed algorithm\n\nPseudocode for our proposed algorithm is given in Algorithm 1. NYS-SINK (pronounced \u201cnice sink\u201d)\ncomputes a low-rank Nystr\u00f6m approximation of the kernel matrix via a column sampling procedure.\nFor reasons of space, full pseudocode and proofs of all claims are deferred to the supplement.\nAs noted in Section 1, the Nystr\u00f6m method constructs a low-rank approximation to a Gaussian kernel\nmatrix K = e\u2212\u03b7C based on a small number of its columns. In order to design an ef\ufb01cient algorithm,\nwe aim to construct such an approximation with the smallest possible rank. The key quantity for\nunderstanding the error of this algorithm is the so-called effective dimension (also sometimes called\nthe \u201cdegrees of freedom\u201d) of the kernel matrix K [18, 30, 46].\nDe\ufb01nition 2. Let \u03bbj(K) denote the jth largest eigenvalue of K (with multiplicity). Then the effective\ndimension of K at level \u03c4 > 0 is\n\nn(cid:88)\n\n\u03bbj(K)\n\n\u03bbj(K) + \u03c4 n\n\nj=1\n\ndeff(\u03c4 ) :=\n\n.\n\n(2)\n\nThe effective dimension deff(\u03c4 ) indicates how large the rank of an approximation \u02dcK to K must be\nin order to obtain the guarantee (cid:107) \u02dcK \u2212 K(cid:107)op (cid:54) \u03c4 n. For our application, we have K = e\u03b7C, and\nwe will show that it suf\ufb01ces to obtain an approximate kernel \u02dcK satisfying (cid:107) \u02dcK \u2212 K(cid:107)op (cid:54) \u03b5(cid:48)\n2 e\u22124\u03b7R2,\nwhere \u03b5(cid:48) = \u02dcO(\u03b5R\u22122). We are therefore motivated to de\ufb01ne the following quantity, which informally\ncaptures the smallest possible rank of an approximation of this quality.\nDe\ufb01nition 3. Given X = {x1, . . . , xn} \u2286 Rd with (cid:107)xi(cid:107)2 (cid:54) R for all i \u2208 [n], \u03b7 > 0, and \u03b5(cid:48) \u2208 (0, 1),\nthe approximation rank is\n\nr\u2217(X, \u03b7, \u03b5(cid:48)) := deff\n\n(cid:16) \u03b5(cid:48)\n2n e\u22124\u03b7R2(cid:17)\n\nwhere deff(\u00b7) is the effective rank for the kernel matrix K := e\u2212\u03b7C.\nAs we show below, we adaptively construct an approximate kernel \u02dcK whose rank is at most a\nlogarithmic factor bigger than r\u2217(X, \u03b7, \u03b5(cid:48)) with high probability. We also give concrete bounds on\nr\u2217(X, \u03b7, \u03b5(cid:48)) below.\nOur proposed algorithm makes use of several subroutines. The ADAPTIVENYSTR\u00d6M procedure\nin Algorithm 1 combines an algorithm of Musco & Musco [30] with a doubling trick that enables\nautomatic adaptivity. It outputs the approximate kernel \u02dcK and its rank r.\n\n4\n\n\fThe SINKHORN procedure in Algorithm 1 is the Sinkhorn scaling algorithm for projecting \u02dcK onto\nM(p, q). We use a variant of the standard algorithm, which returns both the scaling matrices and\nan approximation of the cost of an optimal solution. The ROUND procedure in Algorithm 1 is\nAlgorithm 2 of Altschuler et al. [3].\nWe emphasize that neither D1 \u02dcKD2 nor \u02c6P are ever represented explicitly, since this would take \u2126(n2)\ntime. Instead, we maintain these matrices in low-rank factorized forms. This enables Algorithm 1\nto be implemented ef\ufb01ciently in o(n2) time, since the procedures SINKHORN and ROUND can both\nbe implemented such that they depend on \u02dcK only through matrix-vector multiplications with \u02dcK.\nMoreover, we also emphasize that all steps of Algorithm 1 are easily parallelizable since they can be\nre-written in terms of matrix-vector multiplications.\nWe note also that although the present paper focuses speci\ufb01cally on the squared Euclidean cost\nc(xi, xj) = (cid:107)xi\u2212xj(cid:107)2\n2 (corresponding to the 2-Wasserstein case of optimal transport pervasively used\nin applications; see intro), our algorithm NYS-SINK readily extends to other cases of optimal transport.\nIndeed, since the Nystr\u00f6m method works not only for Gaussian kernel matrices Kij = e\u2212\u03b7(cid:107)xi\u2212xj(cid:107)2\n2,\nbut in fact more generally for any PSD kernel matrix, our algorithm can be used on any optimal\ntransport instance for which the corresponding kernel matrix Kij = e\u2212\u03b7c(xi,xj ) is PSD.\n\nAlgorithm 1 NYS-SINK\n1: Input: X = {x1, . . . , xn} \u2286 Rd, p, q \u2208 \u2206n, \u03b5, \u03b7 > 0\n2: Output: \u02c6P \u2208 M(p, q), \u02c6W \u2208 R, r \u2208 N\n3: \u03b5(cid:48) \u2190 min(1,\n4: ( \u02dcK, r) \u2190 ADAPTIVENYSTR\u00d6M(X, \u03b7, \u03b5(cid:48)\n5: (D1, D2, \u02c6W ) \u2190 SINKHORN( \u02dcK, p, q, \u03b5(cid:48))\n6: \u02c6P \u2190 ROUND(D1 \u02dcKD2, p, q)\n\n{Approximate Sinkhorn projection and cost}\n\n{Compute low-rank approximation}\n\n2 e\u22124\u03b7R2\n\n50(4R2\u03b7+log\n\n)\n\n\u03b5\u03b7\n\n)\n\nn\n\u03b7\u03b5 )\n\n{Round to feasible set}\n\n7: Return \u02c6P , \u02c6W\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)(cid:17)\n\nOur main result is the following.\nTheorem 1. Let \u03b5, \u03b4 \u2208 (0, 1). Algorithm 1 runs in \u02dcO\ntime, uses O(n(r + d)) space,\nand returns a feasible matrix \u02c6P \u2208 M(p, q) in factored form and scalars \u02c6W \u2208 R and r \u2208 N, where\n(3a)\n(3b)\n(3c)\n\n|VC( \u02c6P ) \u2212 W\u03b7(p, q)| (cid:54) \u03b5,\nKL( \u02c6P(cid:107)P \u03b7) (cid:54) \u03b7\u03b5,\n| \u02c6W \u2212 W\u03b7(p, q)| (cid:54) \u03b5,\n\nr + \u03b7R4\n\u03b5\n\nnr\n\nand, with probability 1 \u2212 \u03b4,\n\nfor a universal constant c and where \u03b5(cid:48) = \u02dc\u2126(\u03b5R\u22122).\n\nr (cid:54) c \u00b7 r\u2217(X, \u03b7, \u03b5(cid:48)) log n\n\u03b4 ,\n\n(3d)\n\nWe note that, while our algorithm is randomized, we obtain a deterministic guarantee that \u02c6P is a good\nsolution. We also note that runtime dependence on the radius R\u2014which governs the scale of the\nproblem\u2014is inevitable since we seek an additive guarantee.\nWe show in Section 3 that r\u2217\u2014which controls the running time of the algorithm with high probability\nby (3d)\u2014adapts to the intrinsic dimension of the data. This adaptivity is crucial in applications,\nwhere data can have much lower dimension than the ambient space. We informally summarize this\nbehavior in the following theorem.\nTheorem 2 (Informal). There exists an universal constant c > 0 such that, for any n points in a ball\nof radius R in Rd, r\u2217(X, \u03b7, \u03b5(cid:48)) (cid:54) (c(\u03b7R2 + log n\n\u03b5(cid:48)\u03b7 ))d. Moreover, for any k-dimensional manifold \u2126\nsatisfying certain technical conditions and \u03b7 > 0, there exists a constant c\u2126,\u03b7 such that for any n\npoints lying on \u2126, r\u2217(X, \u03b7, \u03b5(cid:48)) (cid:54) c\u2126,\u03b7(log n\n\n\u03b5(cid:48) )5k/2.\n\n5\n\n\fThe formal versions of these bounds appear in Section 3. The second bound is signi\ufb01cantly better\nthan the \ufb01rst when k (cid:28) d, and clearly shows the bene\ufb01ts of an adaptive procedure.\nCombining Theorems 1 and 2 yields the following time and space complexity for our algorithm.\nCorollary 2 (Informal). If X consists of n points lying in a ball of radius R in Rd, then with high\nprobability Algorithm 1 requires \u02dcO(n\u03b5\u22121(c\u03b7R2 + c log n\n\u03b5 )d)\nspace. Moreover, if X lies on a k-dimensional manifold \u2126, then with high probability Algorithm 1\nrequires \u02dcO(n\u03b5\u22121c\u2126,\u03b7(log n\n\n\u03b5 )2d+1) time and \u02dcO(n(c\u03b7R2 + c log n\n\n\u03b5 )5k) time and \u02dcO(nc\u2126,\u03b7(log n\n\n\u03b5 )5k/2) space.\n\n3 Kernel Approximation via the Nystr\u00f6m Method\nGiven points X = {x1, . . . , xn} with (cid:107)xi(cid:107)2 (cid:54) R for all i \u2208 [n], let K \u2208 Rn\u00d7n denote the matrix\nwith entries Kij := k\u03b7(xi, xj), where k\u03b7(x, x(cid:48)) := e\u2212\u03b7(cid:107)x\u2212x(cid:48)(cid:107)2. Note that k\u03b7(x, x(cid:48)) is the Gaussian\nconsider an approximation of the matrix K that is of the form (cid:101)K = V A\u22121V (cid:62), where V \u2208 Rn\u00d7r and\n2\u03b7 . For r \u2208 N, we\nkernel e\u2212(cid:107)x\u2212x(cid:48)(cid:107)2/(2\u03c32) between points x and x(cid:48) with bandwith parameter \u03c32 = 1\nA \u2208 Rr\u00d7r. Note that the matrix (cid:101)K is never computed explicitly. Indeed, our proposed Algorithm 1\nonly depends on (cid:101)K through computing matrix-vector products (cid:101)Kv, where v \u2208 Rn, and these can be\ncomputed ef\ufb01ciently as (cid:101)Kv = V (L\u2212(cid:62)(L\u22121(V (cid:62)v))), where L \u2208 Rr\u00d7r is the lower triangular matrix\n\nsatisfying LL(cid:62) = A. Once a Cholesky decomposition of A has been obtained\u2014at computational\ncost O(r3)\u2014matrix-vector products can therefore be computed in time O(nr). In the supplement,\nwe give pseudocode for the AdaptiveNystr\u00f6m subroutine, based on a simple doubling trick. It enjoys\nthe following guarantee:\nLemma 1. Let \u02dcK denote the (random) kernel output by ADAPTIVENYSTR\u00d6M(X, \u03b7, \u03c4 ), and let\nr := rank( \u02dcK). Then (cid:107)K \u2212 \u02dcK(cid:107)\u221e (cid:54) \u03c4, the algorithm used O(nr) space and terminated in O(nr2)\ntime, and there exists a universal constant c such that simultaneously for every \u03b4 > 0,\n\nP(cid:16)\n\nr (cid:54) c \u00b7 deff\n\n(cid:0) \u03c4\n\nn\n\n(cid:1) log(cid:0) n\n\n\u03b4\n\n(cid:1)(cid:17) (cid:62) 1 \u2212 \u03b4.\n\n3.1 General results: data points lie in a ball\nIn this section we assume no structure on X apart from the fact that X \u2286 Bd\nR is a ball\nof radius R in Rd centered around the origin, for some R > 0 and d \u2208 N. First we characterize the\neigenvalues of K in terms of \u03b7, d, R, and then we use this to bound deff.\nTheorem 3. Let X := {x1, . . . xn} \u2286 Bd\nR, and let K \u2208 Rn\u00d7n be the matrix with entries Kij :=\n4e2\u03b7R2 . 2. For \u03c4 \u2208 (0, 1],\ne\u2212\u03b7(cid:107)xi\u2212xj(cid:107)2. Then: 1. For t (cid:62) (2e)d, \u03bbt+1(K) (cid:54) ne\n\n2e t1/d log d t1/d\n\nR where Bd\n\n\u2212 d\n\ndeff(\u03c4 ) (cid:54) 3(cid:0)6 + 41\n\nd \u03b7R2 + 3\n\nd log 1\n\n\u03c4\n\n(cid:1)d.\n\nCorollary 3. Let \u03b5(cid:48) \u2208 (0, 1) and \u03b7 > 0. If X consists of n points lying in a ball of radius R around\nthe origin in Rd, then\n\n(cid:18)\n\n(cid:19)d\n\n.\n\nr\u2217(X, \u03b7, \u03b5(cid:48)) (cid:54) 3\n\n6 +\n\n53\nd\n\n\u03b7R2 +\n\n3\nd\n\nlog\n\n2n\n\u03b5(cid:48)\n\n3.2 Adaptivity: data points lie on a low dimensional manifold\n\nIn this section, we show that the quality of the Nystr\u00f6m approximation adapts to the intrinsic\ndimension of the data. Let \u2126 \u2282 Rd be a smooth compact manifold without boundary of dimension k,\nfor k < d, and let (\u03a8j, Uj)j\u2208[T ], with T \u2208 N, be an atlas for \u2126. We assume the following quantitative\ncontrol on the smoothness of the atlas.\nAssumption 1. There exists Q > 0 such that\n\n(cid:107)D\u03b1\u03a8\u22121\n\nj (u)(cid:107) (cid:54) Q|\u03b1|,\n\n\u03b1 \u2208 Nk, j \u2208 [T ],\n\nsup\nu\u2208Bk\nrj\n\nwhere |\u03b1| =(cid:80)k\n\nj=1 \u03b1j and D\u03b1 =\n\n\u2202|\u03b1|\n1 ...\u2202u\n\n\u2202u\u03b11\n\n\u03b1k\nk\n\n, for \u03b1 \u2208 Nk.\n\n6\n\n\fTheorem 4. Let \u2126 \u2282 Bd\nR \u2282 Rd be a smooth compact manifold without boundary satisfying\nAssumption 1. Let X := {x1, . . . xn} \u2286 \u2126, and let K \u2208 Rn\u00d7n be the matrix with entries Kij :=\ne\u2212\u03b7(cid:107)xi\u2212xj(cid:107)2. Then: 1. There exists a constant c not depending on X or n, such that for t (cid:62) 0,\n5k . 2. There exist c1, c2 not depending on X, n, or \u03c4, such that for \u03c4 \u2208 (0, 1],\n\u03bbt+1(K) (cid:54) ne\u2212ct\n\ndeff(\u03c4 ) (cid:54)(cid:0)c1 log 1\n\n\u03c4\n\n2\n\n(cid:1)5k/2\n\n+ c2.\n\nThe result above is new, to our knowledge, and extends interpolation results on manifolds [19, 23, 43],\nfrom polynomial to exponential decay, generalizing a technique of Rieger & Zwicknagl [33] to a\nsubset of real analytic manifolds. The crucial point is that now the eigenvalue decay and the effective\ndimension depend on the dimension of the manifold k and not the ambient dimension d (cid:29) k.\nCorollary 4. Let \u03b5(cid:48) \u2208 (0, 1), \u03b7 > 0, and let \u2126 \u2282 Rd be a manifold of dimension k (cid:54) d satisfying\nAssumption 1. There exists c\u2126,\u03b7 > 0 not depending on X or n such that\n\n(cid:16)\n\n(cid:17)5k/2\n\n.\n\nr\u2217(X, \u03b7, \u03b5(cid:48)) (cid:54) c\u2126,\u03b7\n\nlog\n\nn\n\u03b5(cid:48)\n\n4 Sinkhorn Scaling an Approximate Kernel Matrix\n\n\u03b5\u03b7\n\n).\n\nn\n\u03b7\u03b5 )\n\n50((cid:107)C(cid:107)\u221e\u03b7+log\n\nThe main result of this section, presented next, gives both a runtime bound and an error bound\non the approximate Sinkhorn scaling performed in Algorithm 1. The error bound shows that the\nobjective function VC(\u00b7) in (1) is stable with respect to both (i) Sinkhorn projecting an approximate\nkernel matrix \u02dcK instead of the true kernel matrix K, and (ii) only performing an approximate\nSinkhorn projection.The results of this section apply to any bounded cost matrix C \u2208 Rn\u00d7n, with\n\u03b5(cid:48) := min(1,\nsatis\ufb01es (cid:107) log K \u2212 log \u02dcK(cid:107)\u221e (cid:54) \u03b5(cid:48), then the Sinkhorn\nTheorem 5. If K = e\u2212\u03b7C and if \u02dcK \u2208 Rn\u00d7n\nsubroutine in Algorithm 1 outputs D1, D2, and \u02c6W such that \u02dcP := D1 \u02dcKD2 satis\ufb01es (cid:107) \u02dcP 1 \u2212 p(cid:107)1 +\n(cid:107) \u02dcP (cid:62)1 \u2212 q(cid:107)1 (cid:54) \u03b5(cid:48), |VC(P \u03b7) \u2212 VC( \u02dcP )| (cid:54) \u03b5\n2 , and | \u02c6W \u2212 VC( \u02dcP )| (cid:54) \u03b5\n2 . Moreover, if matrix-\nvector products can be computed with \u02dcK and \u02dcK(cid:62) in time TMULT, then this takes time \u02dcO((n +\nTMULT)\u03b7(cid:107)C(cid:107)\u221e\u03b5(cid:48)\u22121).\nThe running time bound in Theorem 5 for the time required to produce D1 and D2 follows directly\nfrom prior work which has shown that Sinkhorn scaling can produce an approximation to the Sinkhorn\nprojection of a positive matrix in time nearly independent of the dimension n [3, 14]. The error\nbounds in Theorem 5 are based on Propositions 2 and 3.\nProposition 2. For any p, q \u2208 \u2206n and any K, \u02dcK \u2208 Rn\u00d7n\n+ ,\n\n>0\n\n(cid:107)\u03a0S (K) \u2212 \u03a0S ( \u02dcK)(cid:107)1 (cid:54) (cid:107) log K \u2212 log \u02dcK(cid:107)\u221e .\n\nProposition 3. Given \u02dcK \u2208 Rn\u00d7n\n>0 , let \u02dcC \u2208 Rn\u00d7n satisfy \u02dcCij := \u2212\u03b7\u22121 log \u02dcKij. Let D1 and D2 be\npositive diagonal matrices such that \u02dcP := D1 \u02dcKD2 \u2208 \u2206n\u00d7n, with \u03b4 := (cid:107)p \u2212 \u02dcP 1(cid:107)1 + (cid:107)q \u2212 \u02dcP (cid:62)1(cid:107)1.\nIf \u03b4 (cid:54) 1, then\n\n|V \u02dcC(\u03a0S ( \u02dcK)) \u2212 V \u02dcC( \u02dcP )| (cid:54) \u03b4(cid:107) \u02dcC(cid:107)\u221e + \u03b7\u22121\u03b4 log\n\n2n\n\u03b4\n\n.\n\n5 Experimental Results\n\nIn this section we empirically validate our theoretical results. Details about the setup for each\nexperiment appear in the supplement.\nWe \ufb01rst compare to the standard Sinkhorn algorithm. Fig. 1 plots the time-accuracy tradeoff for NYS-\nSINK, compared to the standard SINKHORN algorithm. Fig. 1 shows that NYS-SINK is consistently\norders of magnitude faster to obtain the same accuracy.\nWe then investigate NYS-SINK\u2019s dependence on the intrinsic dimension and ambient dimension of the\ninput. This is done by running NYS-SINK with a \ufb01xed approximation rank on distributions supported\non 1-dimensional curves embedded in higher dimensions. Fig. 2 empirically validates the result in\n\n7\n\n\fFigure 1: Time-accuracy tradeoff for NYS-SINK and SINKHORN, for a range of regularization\nparameters \u03b7 (each corresponding to a different Sinkhorn distance W\u03b7) and approximation ranks r.\nEach experiment has been repeated 50 times; the variance is indicated by the shaded area around the\ncurves. Note that curves in the plot start at different points corresponding to the time required for\ninitialization.\n\nFigure 2: Accuracy of NYS-SINK as a func-\ntion of running time, for different ambient\ndimensions.\n\nFigure 3: Running time vs input size n for\nNYS-SINK and SINKHORN. Top uses random\npoint cloud data as in Fig. 1, bottom uses\nembedded curve data as in Fig. 2.\n\nCorollary 4, namely that the required approximation rank \u2013 and consequently the computational\ncomplexity of NYS-SINK \u2013 is independent of the ambient dimension.\nNext, we demonstrate NYS-SINK\u2019s dependence on the size n of the dataset. As Fig. 3 indicates, the\nrunning time of NYS-SINK is empirically well-approximated by a line with slope 1 in the log-log\nplane \u2013 representing a complexity of \u0398(n) \u2013 whereas the running time of SINKHORN scales as \u0398(n2).\n\nTable 1: Performance of our algorithm on benchmark dataset.\n\nExp. 1 n = 3 \u00d7 105, d = 3, \u03b7 = 15\nNys-Sink r = 2000, T = 20\nDual-Sink Multiscale + Anneal. r = 0.95\nDual-Sink + Anneal. r = 0.95\nExp. 2 n = 3.8 \u00d7 106, d = 3, \u03b7 = 15\nNys-Sink r = 2000, T = 20\nDual-Sink Multiscale + Anneal. r = 0.95\nDual-Sink + Anneal. r = 0.95\n\n8\n\nW2\n\n0.087 \u00b1 0.008\n\n0.090\n0.087\n\nW2\n\n0.11 \u00b1 0.01\n\n0.11\n0.10\n\ntime (s)\n0.4 \u00b1 0.1\n\n3.4\n35.4\n\ntime (s)\n6.3 \u00b1 0.8\n103.6\n1168\n\n\fMoreover, SINKHORN saturates the RAM already for n \u2248 104, whereas NYS-SINK can scale to\nn \u2248 106 on the same machine.\nFinally, we evaluate the performance of our algorithm on a benchmark dataset used in computer\ngraphics. We measured Wasserstein distance between two 3D cloud points from the Stanford 3D\nScanning Repository.2 We ran two experiments, with n = 3 \u00d7 105 and n = 3.8 \u00d7 106 points,\nrespectively.\nWe ran T = 20 iterations of our algorithm (Nys-Sink) with approximation rank r = 2000 on a GPU\nand compared to two optimized implementations in the library GeomLoss.3 The results appear in\nTable 1. Each Nys-Sink experiment was repeated 50 times. Our method for moderate regularization\n\u03b7 is comparable with the other approaches in terms of precision, with a computational time that is\norders of magnitude smaller. We note here that we choose the parameters r, T in Nys-Sink by hand\nto balance precision and time complexity.\nWe note that in these experiments, instead of using our doubling-trick algorithm to choose the rank\nadaptively, we simply run experiments with a small \ufb01xed choice of r. As our experiments demonstrate,\nNYS-SINK achieves good empirical performance even when the rank r is smaller than our theoretical\nanalysis requires. Investigating this empirical success further is an interesting topic for future study.\n\n6 Acknowledgments\n\nWe thank the reviewers for their helpful comments. We also thank Piotr Indyk, Pablo Parrilo, and\nPhilippe Rigollet for helpful discussions. JA was supported in part by NSF Graduate Research\nFellowship 1122374. FB and AR were supported in part by the European Research Council (grant\nSEQUOIA 724063). JNW was supported in part by the Josephine de K\u00e1rm\u00e1n Fellowship.\n\nReferences\n[1] Alaoui, A. and Mahoney, M. W. Fast randomized kernel ridge regression with statistical\n\nguarantees. In Adv. NIPS, pp. 775\u2013783, 2015.\n\n[2] Allen-Zhu, Z., Li, Y., Oliveira, R., and Wigderson, A. Much faster algorithms for matrix scaling.\n\nIn FOCS, pp. 890\u2013901. 2017.\n\n[3] Altschuler, J., Weed, J., and Rigollet, P. Near-linear time approximation algorithms for optimal\n\ntransport via Sinkhorn iteration. In Adv. NIPS, pp. 1961\u20131971, 2017.\n\n[4] Altschuler, J., Bach, F., Rudi, A., and Weed, J. Approximating the quadratic transportation\n\nmetric in near-linear time. arXiv preprint arXiv:1810.10046, 2018.\n\n[5] Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[6] Bach, F. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning\n\nTheory, pp. 185\u2013209, 2013.\n\n[7] Belkin, M. Approximation beats concentration? An approximation view on inference with\n\nsmooth radial kernels. arXiv preprint arXiv:1801.03437, 2018.\n\n[8] Blanchet, J., Jambulapati, A., Kent, C., and Sidford, A. Towards optimal running times for\n\noptimal transport. arXiv preprint arXiv:1810.07717, 2018.\n\n[9] Bousquet, O., Gelly, S., Tolstikhin, I., Simon-Gabriel, C.-J., and Schoelkopf, B. From optimal\ntransport to generative modeling: the VEGAN cookbook. arXiv preprint arXiv:1705.07642,\n2017.\n\n[10] Cohen, M. B., Madry, A., Tsipras, D., and Vladu, A. Matrix scaling and balancing via box\n\nconstrained Newton\u2019s method and interior point methods. In FOCS, pp. 902\u2013913. 2017.\n\n2http://graphics.stanford.edu/data/3Dscanrep/\n3http://www.kernel-operations.io/geomloss/\n\n9\n\n\f[11] Courty, N., Flamary, R., and Tuia, D. Domain adaptation with regularized optimal transport. In\n\nECML PKDD, pp. 274\u2013289, 2014.\n\n[12] Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. Optimal transport for domain\n\nadaptation. IEEE Trans. Pattern Anal. Mach. Intell., 39(9):1853\u20131865, 2017.\n\n[13] Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Adv. NIPS, pp.\n\n2292\u20132300, 2013.\n\n[14] Dvurechensky, P., Gasnikov, A., and Kroshnin, A. Computational optimal transport: Com-\nplexity by accelerated gradient descent is better than by Sinkhorn\u2019s algorithm. arXiv preprint\narXiv:1802.04367, 2018.\n\n[15] Fine, S. and Scheinberg, K. Ef\ufb01cient SVM training using low-rank kernel representations.\n\nJournal of Machine Learning Research, 2:243\u2013264, 2001.\n\n[16] Forrow, A., H\u00fctter, J.-C., Nitzan, M., Rigollet, P., Schiebinger, G., and Weed, J. Statistical\n\noptimal transport via factored couplings. arXiv preprint arXiv:1806.07348, 2018.\n\n[17] Franklin, J. and Lorenz, J. On the scaling of multidimensional matrices. Linear Algebra Appl.,\n\n114/115:717\u2013735, 1989.\n\n[18] Friedman, J., Hastie, T., and Tibshirani, R. The elements of statistical learning, volume 1.\n\nSpringer series in statistics New York, NY, USA:, 2001.\n\n[19] Fuselier, E. and Wright, G. B. Scattered data interpolation on embedded submanifolds with\nrestricted positive de\ufb01nite kernels: Sobolev error estimates. SIAM Journal on Numerical\nAnalysis, 50(3):1753\u20131776, 2012.\n\n[20] Genevay, A., Cuturi, M., Peyr\u00e9, G., and Bach, F. Stochastic optimization for large-scale optimal\n\ntransport. In Adv. NIPS, pp. 3440\u20133448. 2016.\n\n[21] Genevay, A., Peyr\u00e9, G., and Cuturi, M. Learning generative models with Sinkhorn divergences.\n\nIn AISTATS, pp. 1608\u20131617, 2018.\n\n[22] Gittens, A. The spectral norm error of the naive Nystr\u00f6m extension. Arxiv preprint\n\narXiv:1110.5305, 2011.\n\n[23] Hangelbroek, T., Narcowich, F. J., and Ward, J. D. Kernel approximation on manifolds I:\nbounding the Lebesgue constant. SIAM Journal on Mathematical Analysis, 42(4):1732\u20131760,\n2010.\n\n[24] Kalantari, B., Lari, I., Ricca, F., and Simeone, B. On the complexity of general matrix scaling\nand entropy minimization via the RAS algorithm. Math. Program., 112(2, Ser. A):371\u2013401,\n2008.\n\n[25] Kolouri, S., Park, S. R., Thorpe, M., Slepcev, D., and Rohde, G. K. Optimal mass transport:\nSignal processing and machine-learning applications. IEEE Signal Process. Mag., 34(4):43\u201359,\n2017.\n\n[26] Li, P., Wang, Q., and Zhang, L. A novel earth mover\u2019s distance methodology for image matching\n\nwith Gaussian mixture models. In ICCV, pp. 1689\u20131696, 2013.\n\n[27] Linial, N., Samorodnitsky, A., and Wigderson, A. A deterministic strongly polynomial algorithm\n\nfor matrix scaling and approximate permanents. In STOC, pp. 644\u2013652. ACM, 1998.\n\n[28] Mahoney, M. W. and Drineas, P. CUR matrix decompositions for improved data analysis. PNAS,\n\n106(3):697\u2013702, 2009.\n\n[29] Montavon, G., M\u00fcller, K., and Cuturi, M. Wasserstein training of restricted Boltzmann machines.\n\nIn Adv. NIPS, pp. 3711\u20133719, 2016.\n\n[30] Musco, C. and Musco, C. Recursive sampling for the Nystrom method. In Adv. NIPS, pp.\n\n3833\u20133845, 2017.\n\n10\n\n\f[31] Peyr\u00e9, G. and Cuturi, M. Computational optimal transport. Technical report, 2017.\n\n[32] Quanrud, K. Approximating optimal transport with linear programs. In SOSA, 2019. To appear.\n\n[33] Rieger, C. and Zwicknagl, B. Sampling inequalities for in\ufb01nitely smooth functions, with\napplications to interpolation and machine learning. Advances in Computational Mathematics,\n32(1):103, 2010.\n\n[34] Rigollet, P. and Weed, J. Entropic optimal transport is maximum-likelihood deconvolution.\n\nComptes Rendus Math\u00e9matique, 2018. To appear.\n\n[35] Rubner, Y., Tomasi, C., and Guibas, L. J. The earth mover\u2019s distance as a metric for image\n\nretrieval. International journal of computer vision, 40(2):99\u2013121, 2000.\n\n[36] Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subramanian, V., Solomon, A., Liu, S., Lin,\nS., Berube, P., Lee, L., et al. Reconstruction of developmental landscapes by optimal-transport\nanalysis of single-cell gene expression sheds light on cellular reprogramming. Cell, 2019. To\nappear.\n\n[37] Shawe-Taylor, J. and Cristianini, N. Kernel Methods for Pattern Analysis. Camb. U. P., 2004.\n\n[38] Sinkhorn, R. Diagonal equivalence to matrices with prescribed row and column sums. The\n\nAmerican Mathematical Monthly, 74(4):402\u2013405, 1967.\n\n[39] Smola, A. J. and Sch\u00f6lkopf, B. Sparse greedy matrix approximation for machine learning. In\n\nProc. ICML, 2000.\n\n[40] Solomon, J., de Goes, F., Peyr\u00e9, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., and Guibas,\nL. Convolutional Wasserstein distances: Ef\ufb01cient optimal transportation on geometric domains.\nACM Trans. Graph., 34(4):66:1\u201366:11, July 2015.\n\n[41] Tenetov, E., Wolansky, G., and Kimmel, R. Fast Entropic Regularized Optimal Transport Using\n\nSemidiscrete Cost Approximation. SIAM J. Sci. Comput., 40(5):A3400\u2013A3422, 2018.\n\n[42] Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media,\n\n2008.\n\n[43] Wendland, H. Scattered data approximation, volume 17. Cambridge university press, 2004.\n\n[44] Williams, C. and Seeger, M. Using the Nystr\u00f6m method to speed up kernel machines. In Adv.\n\nNIPS, 2001.\n\n[45] Wilson, A. G. The use of entropy maximising models, in the theory of trip distribution, mode\n\nsplit and route split. Journal of Transport Economics and Policy, pp. 108\u2013126, 1969.\n\n[46] Zhang, T. Learning bounds for kernel regression using effective data dimensionality. Neural\n\nComputation, 17(9):2077\u20132098, 2005.\n\n11\n\n\f", "award": [], "sourceid": 2482, "authors": [{"given_name": "Jason", "family_name": "Altschuler", "institution": "MIT"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}, {"given_name": "Alessandro", "family_name": "Rudi", "institution": "INRIA, Ecole Normale Superieure"}, {"given_name": "Jonathan", "family_name": "Niles-Weed", "institution": "NYU"}]}