{"title": "Learning from uncertain curves: The 2-Wasserstein metric for Gaussian processes", "book": "Advances in Neural Information Processing Systems", "page_first": 5660, "page_last": 5670, "abstract": "We introduce a novel framework for statistical analysis of populations of non-degenerate Gaussian processes (GPs), which are natural representations of uncertain curves. This allows inherent variation or uncertainty in function-valued data to be properly incorporated in the population analysis. Using the 2-Wasserstein metric we geometrize the space of GPs with L2 mean and covariance functions over compact index spaces. We prove uniqueness of the barycenter of a population of GPs, as well as convergence of the metric and the barycenter of their finite-dimensional counterparts. This justifies practical computations. Finally, we demonstrate our framework through experimental validation on GP datasets representing brain connectivity and climate development. A Matlab library for relevant computations will be published at https://sites.google.com/view/antonmallasto/software.", "full_text": "Learning from uncertain curves:\n\nThe 2-Wasserstein metric for Gaussian processes\n\nAnton Mallasto\n\nDepartment of Computer Science\n\nUniversity of Copenhagen\n\nmallasto@di.ku.dk\n\nAasa Feragen\n\nDepartment of Computer Science\n\nUniversity of Copenhagen\n\naasa@di.ku.dk\n\nAbstract\n\nWe introduce a novel framework for statistical analysis of populations of non-\ndegenerate Gaussian processes (GPs), which are natural representations of uncertain\ncurves. This allows inherent variation or uncertainty in function-valued data to be\nproperly incorporated in the population analysis. Using the 2-Wasserstein metric we\ngeometrize the space of GPs with L2 mean and covariance functions over compact\nindex spaces. We prove uniqueness of the barycenter of a population of GPs, as well\nas convergence of the metric and the barycenter of their \ufb01nite-dimensional counter-\nparts. This justi\ufb01es practical computations. Finally, we demonstrate our framework\nthrough experimental validation on GP datasets representing brain connectivity and\nclimate development. A MATLAB library for relevant computations will be pub-\nlished at https://sites.google.com/view/antonmallasto/software.\n\n1\n\nIntroduction\n\nGaussian processes (GPs, see Fig. 1) are the\ncounterparts of Gaussian distributions (GDs)\nover functions, making GPs natural objects to\nmodel uncertainty in estimated functions. With\nthe rise of GP modelling and probabilistic nu-\nmerics, GPs are increasingly used to model un-\ncertainty in function-valued data such as seg-\nmentation boundaries [17, 19, 30], image regis-\ntration [38] or time series [28]. Centered GPs, or\ncovariance operators, appear as image features\nin computer vision [12,16,25,26] and as features\nof phonetic language structure [23]. A natural\nnext step is therefore to analyze populations of\nGPs, where performance depends crucially on\nproper incorporation of inherent uncertainty or\nvariation. This paper contributes a principled\nframework for population analysis of GPs based on Wasserstein, a.k.a. earth mover\u2019s, distances.\nThe importance of incorporating uncertainty into population analysis is emphasized by the example\nin Fig. 2, where each data point is a GP representing the minimal temperature in the Siberian city\nVanavara over the course of one year [9, 34]. A na\u00efve way to compute its average temperature curve\nis to compute the per-day mean and standard deviation of the yearly GP mean curves. This is shown\nin the bottom right plot, and it is clear that the temperature variation is grossly underestimated,\nespecially in the summer season. The top right \ufb01gure shows the mean GP obtained with our proposed\nframework, which preserves a far more accurate representation of the natural temperature variation.\n\nFigure 1: An illustration of a GP, with mean func-\ntion (in black) and con\ufb01dence bound (in grey). The\ncolorful curves are sample paths of this GP.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 2: Left: Example GPs describing the daily minimum temperatures in a Siberian city (see\nSec. 4). Right top: The mean GP temperature curve, computed as a Wasserstein barycenter. Note\nthat the inherent variability in the daily temperature is realistically preserved, in contrast with the\nna\u00efve approach. Right bottom: A na\u00efve estimation of the mean and standard deviation of the daily\ntemperature, obtained by taking the day-by-day mean and standard deviation of the temperature. All\n\ufb01gures show a 95% con\ufb01dence interval.\n\nWe propose analyzing populations of GPs by geometrizing the space of GPs through the Wasserstein\ndistance, which yields a metric between probability measures with rich geometric properties. We\ncontribute i) closed-form solutions for arbitrarily good approximation of the Wasserstein distance by\nshowing that the 2-Wasserstein distance between two \ufb01nite-dimensional GP representations converges\nto the 2-Wasserstein distance of the two GPs; and ii) a characterization of a non-degenerate barycenter\nof a population of GPs, and a proof that such a barycenter is unique, and can be approximated by its\n\ufb01nite-dimensional counterpart.\nWe evaluate the Wasserstein distance in two applications. First, we illustrate the use of the Wasserstein\ndistance for processing of uncertain white-matter trajectories in the brain segmented from noisy\ndiffusion-weighted imaging (DWI) data using tractography. It is well known that the noise level and\nthe low resolution of DWI images result in unreliable trajectories (tracts) [24]. This is problematic as\nthe estimated tracts are e.g. used for surgical planning [8]. Recent work [17, 30] utilizes probabilistic\nnumerics [29] to return uncertain tracts represented as GPs. We utilize the Wassertein distance to\nincorporate the estimated uncertainty into typical DWI analysis tools such as tract clustering [37]\nand visualization. Our second study quanti\ufb01es recent climate development based on data from\nRussian meteorological stations using permutation testing on population barycenters, and supplies\ninterpretability of the climate development using GP-valued kernel regression.\n\nRelated work. Multiple frameworks exist for comparing Gaussian distributions (GDs) represented\nby their covariance matrices, including the Frobenius, Fisher-Rao (af\ufb01ne-invariant), log-Euclidean\nand Wasserstein metrics. Particularly relevant to our work is the 2-Wasserstein metric on GDs, whose\nRiemannian geometry is studied in [33], and whose barycenters are well understood [1, 4].\nA body of work exists on generalizing the aforementioned metrics to the in\ufb01nite-dimensional\ncovariance operators. As pointed out in [23], extending the af\ufb01ne-invariant and Log-Euclidean\nmetrics is problematic as covariance operators are not compatible with logarithmic maps and their\ninverses are unbounded. These problems are avoided in [25, 26] by regularizing the covariance\noperators, but unfortunately, this also alters the data in a non-unique way. The Procrustes metric\nfrom [23] avoids this, but as it is, only de\ufb01nes a metric between covariance operators.\nThe 2-Wasserstein metric, on the other hand, generalizes naturally from GDs to GPs, does not require\nregularization, and can be arbitrarily well approximated by a closed form expression, making the\ncomputations cheap. Moreover, the theory of optimal transport [5, 6, 36] shows that the Wasserstein\nmetric yields a rich geometry, which is further demonstrated by the previous work on GDs [33].\nAfter this work was presented in NIPS, a preprint appeared [20] which also studies convergence\nresults and barycenters of GPs in the Wasserstein geometry, in a more general setting.\n\n2\n\n\fStructure. Prior to introducing the Wasserstein distance between GPs, we review GPs, their Hilbert\nspace covariance operators and the corresponding Gaussian measures in Sec. 2. In Sec. 3 we introduce\nthe Wasserstein metric and its barycenters for GPs and prove convergence properties of the metric\nand barycenters, when GPs are approximated by \ufb01nite-dimensional GDs. Experimental validation is\nfound in Sec. 4, followed by discussion and conclusion in Sec. 5.\n\n2 Prerequisites\n\nGaussian processes and measures. A Gaussian process (GP) f is a collection of random variables,\ni=1 has a joint Gaussian distribution, where xi \u2208 X,\nsuch that any \ufb01nite restriction of its values (f (xi))N\nand X is the index set. A GP is entirely characterized by the pair\n\nm(x) = E [f (x)] , k(x, x(cid:48)) = E [(f (x) \u2212 m(x))(f (x(cid:48)) \u2212 m(x(cid:48)))] ,\n\n(1)\nwhere m and k are called the mean function and covariance function, respectively. We use the\nnotation f \u223c GP(m, k) for a GP f with mean function m and covariance function k. It follows from\nthe de\ufb01nition that the covariance function k is symmetric and positive semide\ufb01nite. We say that f is\nnon-degenerate, if k is strictly positive de\ufb01nite. We will assume the GPs used to be non-degenerate.\nGPs relate closely to Gaussian measures on Hilbert spaces. Given probability spaces (X, \u03a3X , \u00b5) and\n(Y, \u03a3Y , \u03bd), we say that the measure \u03bd is a push-forward of \u00b5 if \u03bd(A) = \u00b5(T \u22121(A)) for a measurable\nT : X \u2192 Y and any A \u2208 \u03a3Y . Denote this by T#\u00b5 = \u03bd. A Borel measure \u00b5 on a separable Hilbert\nspace H is a Gaussian measure, if its push-forward with respect to any non-zero continuous element\nof the dual space of H is a non-degenerate Gaussian measure on R (i.e., the push-forward gives a\nunivariate Gaussian distribution). A Borel-measurable set B is a Gaussian null set, if \u00b5(B) = 0 for\nany Gaussian measure \u00b5 on X. A measure \u03bd on H is regular if \u03bd(B) = 0 for any Gaussian null set\nB. Note that regular Gaussian measures correspond to non-degenerate GPs.\nCovariance operators. Denote by L2(X) the space of L2-integrable functions from X to R. The\ncovariance function k has an associated integral operator K : L2(X) \u2192 L2(X) de\ufb01ned by\n\n[K\u03c6](x) =\n\nk(x, s)\u03c6(s)ds, \u2200\u03c6 \u2208 L2(X) ,\n\n(2)\n\nX\n\ncalled the covariance operator associated with k. As a by-product of the 2-Wasserstein metric\non centered GPs, we get a metric on covariance operators. The operator K is Hilbert-Schmidt,\nself-adjoint, compact, positive, and of trace class, and the space of such covariance operators is a\nconvex space. Furthermore, the assignment k (cid:55)\u2192 K from L2(X \u00d7 X) is an isometric isomorphism\nonto the space of Hilbert-Schmidt operators on L2(X) [7, Prop. 2.8.6]. This justi\ufb01es us to write both\nf \u223c GP(m, K) and f \u223c GP(m, k).\n\nTrace of an operator. The Wasserstein distance between GPs admits an analytical formula using\ntraces of their covariance operators, as we will see below. Let (H,(cid:104)\u00b7,\u00b7(cid:105)) be a separable Hilbert space\nwith the orthonormal basis {ek}\u221e\nk=1. Then the trace of a bounded linear operator T on H is given by\n(3)\n\n\u221e(cid:88)\n\n(cid:104)T ek, ek(cid:105) ,\n\nTr T :=\n\nwhich is absolutely convergent and independent of the choice of the basis if Tr (T \u2217T ) 1\nT \u2217 denotes the adjoint operator of T and T 1\nclass operator. For positive self-adjoint operators, the trace is the sum of their eigenvalues.\n\n2 < \u221e, where\n2 is the square-root of T . In this case T is called a trace\n\nk=1\n\n(cid:90)\n\nThe Wasserstein metric. The Wasserstein metric on probability measures derives from the optimal\ntransport problem introduced by Monge and made rigorous by Kantorovich. The p-Wasserstein\ndistance describes the minimal cost of transporting the unit mass of one probability measure into the\nunit mass of another probability measure, when the cost is given by a Lp distance [5, 6, 36].\nLet (M, d) be a Polish space (complete and separable metric space) and denote by Pp(M ) the set\nM dp(x, x0)d\u00b5(x) < \u221e for some x0 \u2208 M. The\np-Wasserstein distance between two probability measures \u00b5, \u03bd \u2208 Pp(M ) is given by\n\nof all probability measures \u00b5 on M satisfying(cid:82)\n\nWp(\u00b5, \u03bd) =\n\ninf\n\n\u03b3\u2208\u0393[\u00b5,\u03bd]\n\nM\u00d7M\n\ndp(x1, x2)d\u03b3(x1, x2)\n\n, (x1, x2) \u2208 M \u00d7 M,\n\n(4)\n\n(cid:18)\n\n(cid:90)\n\n(cid:19) 1\n\np\n\n3\n\n\fwhere \u0393[\u00b5, \u03bd] is the set of joint measures on M \u00d7 M with marginals \u00b5 and \u03bd. De\ufb01ned as above, Wp\nsatis\ufb01es the properties of a metric. Furhermore, a minimizer in (4) is always achieved.\n\n3 The Wasserstein metric for GPs\n\nWe will now study the Wasserstein metric with p = 2 between GPs. For GDs, this has been studied\nin [11, 14, 18, 22, 33].\nFrom now on, assume that all GPs f \u223c GP(m, k) are indexed over a compact X \u2282 Rn so that\nH := L2(X) is separable. Furthermore, we assume m \u2208 L2(X), k \u2208 L2(X \u00d7 X), so that\nobservations of f live almost surely in H. Let f1 \u223c GP(m1, k1) and f2 \u223c GP(m2, k2) be GPs with\nassociated covariance operators K1 and K2 , respectively. As the sample paths of f1 and f2 are in H,\nthey induce Gaussian measures \u00b51, \u00b52 \u2208 P2(H) on H, as there is a 1-1 correspondence between GPs\nhaving sample paths almost surely on a L2(X) space and Gaussian measures on L2(X) [27].\nThe 2-Wasserstein metric between the Gaussian measures \u00b51, \u00b52 is given by [13]\n\nW 2\n\n2 (\u00b51, \u00b52) = d2\n\n2(m1, m2) + Tr (K1 + K2 \u2212 2(K\n\n1\n2\n\n1 K2K\n\n1\n2\n\n1 )\n\n1\n\n2 ),\n\n(5)\n\nwhere d2 is the canonical metric on L2(X). Using this, we get the following de\ufb01nition\nDe\ufb01nition 1. Let f1, f2 be GPs as above, and the induced Gaussian measures of f1 and f2 be \u00b51\nand \u00b52, respectively. Then, their squared 2-Wasserstein distance is given by\n2(m1, m2) + Tr (K1 + K2 \u2212 2(K\n\n2 (f1, f2) := W 2\n\n2 (\u00b51, \u00b52) = d2\n\n1 K2K\n\n1\n2 ) .\n\nW 2\n\n1 )\n\n1\n2\n\n1\n2\n\nRemark 2. Note that the case m1 = m2 = 0 de\ufb01nes a metric for the covariance operators K1, K2,\nas (5) shows that the space of GPs is isometric to the cartesian product of L2(X) and the covariance\noperators. We will denote this metric by W 2\n2 (K1, K2). Furthermore, as GDs are just a subset of GPs,\nW 2\n\n2 yields also the 2-Wasserstein metric between GDs studied in [11, 14, 18, 22, 33].\n\nBarycenters of Gaussian processes. Next, we de\ufb01ne and study barycenters of populations of GPs,\nin a similar fashion as the GD case in [1].\nGiven a population {\u00b5i}N\nseparable Hilbert space, the solution \u00af\u00b5 of the problem\n\ni=1 \u2282 P2(H) and weights {\u03bei \u2265 0}N\n\ni=1 with(cid:80)N\n\ni=1 \u03bei = 1, and H a\n\nN(cid:88)\n\n(P)\n\ninf\n\n\u03beiW 2\n\n2 (\u00b5i, \u00b5),\n\n\u00b5\u2208P2(H)\ni=1 with barycentric coordinates {\u03bei}N\n\ni=1\n\nis the barycenter of the population {\u00b5i}N\nfor GPs is de\ufb01ned to be the barycenter of the associated Gaussian measures.\nRemark 3. The following theorems require the assumption that the barycenter is non-degenerate; it\nis still a conjecture that the barycenter of non-degenerate GPs is nondegenerate [20], but this holds\nin the \ufb01nite-dimensional case of GDs.\n\ni=1. The barycenter\n\nWe now state the main theorem of this section, which follows from Prop. 5 and Prop. 6 below.\nTheorem 4. Let {fi}N\nbarycenter \u00aff \u223c GP( \u00afm, \u00afK) with barycentric coordinates (\u03bei)N\n\u00afK satisfy\n\ni=1 be a population of GPs with fi \u223c GP(mi, Ki), then there exists a unique\ni=1. If \u00aff is non-degenerate, then \u00afm and\n\nN(cid:88)\n\nN(cid:88)\n\n(cid:16) \u00afK\n\n(cid:17) 1\n\n2\n\n\u00afm =\n\n\u03beimi,\n\n\u03bei\n\n1\n\n2 Ki \u00afK\n\n1\n2\n\n= \u00afK.\n\ni=1\n\ni=1\n\ni=1 \u2282 P2(H) and \u00af\u00b5 be a barycenter with barycentric coordinates (\u03bei)N\ni=1.\n\nProposition 5. Let {\u00b5i}N\nAssume \u00b5i is regular for some i, then \u00af\u00b5 is the unique minimizer of (P).\nProof. We \ufb01rst show that the map \u03bd (cid:55)\u2192 W 2\nmeasure. To see this, let \u03bdi \u2208 P2(H) and \u03b3\u2217\n\n2 (\u00b5, \u03bd) is convex, and strictly convex if \u00b5 is a regular\ni \u2208 \u0393[\u00b5, \u03bdi] be the optimal transport plans between \u00b5 and\n\n4\n\n\f(cid:90)\n\n\u03bdi for i = 1, 2, then \u03bb\u03b3\u2217\n\n1 + (1 \u2212 \u03bb)\u03b3\u2217\n\n2 \u2208 \u0393[\u00b5, \u03bb\u03bd1 + (1 \u2212 \u03bb)\u03bd2] for \u03bb \u2208 [0, 1]. Therefore\n\n2 (\u00b5, \u03bb\u03bd1 + (1 \u2212 \u03bb)\u03bd2) =\nW 2\n\u2264\n\n(cid:90)\n\ninf\n\n\u03b3\u2208\u0393[\u00b5,\u03bb\u03bd1+(1\u2212\u03bb)\u03bd2]\n\nd2(x, y)d(\u03bb\u03b3\u2217\n\nd2(x, y)d\u03b3\n\nH\u00d7H\n1 + (1 \u2212 \u03bb)\u03b3\u2217\n2 )\n\nH\u00d7H\n= \u03bbW 2\n\n2 (\u00b5, \u03bd1) + (1 \u2212 \u03bb)W 2\nwhich gives convexity. Note that for \u03bb \u2208]0, 1[, the transport plan \u03bb\u03b3\u2217\n2 splits mass.\nTherefore it cannot be the unique optimal plan between \u00b5 and (1 \u2212 t)\u03bd1 + t\u03bd2. As \u00b5 is regular,\nthe optimal plan does not split mass, as it is induced by a map [3, Thm. 6.2.10], so we have strict\nconvexity. From this follows the strict convexity of the object function in (P).\n\n2 (\u00b5, \u03bd2),\n1 + (1 \u2212 \u03bb)\u03b3\u2217\n\nNext we characterize the barycenter, assuming it is non-degenerate, in the spirit of the \ufb01nite-\ndinemsional case in [1, Thm. 6.1].\nProposition 6. Let {fi}N\ni=1 be a population of centered GPs, fi \u223c GP(0, Ki). Then (P) has a\nunique solution \u00aff \u223c GP(0, \u00afK). If \u00aff is non-degenerate, then \u00afK is the unique bounded self-adjoint\npositive linear operator satisfying\n\n(cid:16)\n\nN(cid:88)\n\ni=1\n\n(cid:17) 1\n\n2\n\n\u03bei\n\nK\n\n1\n\n2 KiK\n\n1\n2\n\n= K.\n\n(6)\n\nProof. Existence can be shown following the proof for the \ufb01nite dimensional case [1, Prop. 4.2],\nwhich uses multimarginal optimal transport; this appears in the preprint [20, Cor. 9]. For the\ncharacterization, assume \u00aff to be non-degenerate, and let\n\nBC(f ) =\n\n\u03beiW 2\n\n2 (fi, f ),\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\nbe the barycentric expression, and assume that the minimizer \u00aff of BC is non-degenerate. Let\n0 < \u03bb1, \u03bb2, ... be the eigenvalues of \u00afK with eigenfunctions e1, e2, .... Then, by [10, Prop. 2.2.] the\ntransport map between \u00aff and fk is given by\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\ni=1\n\nj=1\n\nTk(x) =\n\n(cid:104)x, ej(cid:105)(cid:104)( \u00afK 1\n2 Kk \u00afK 1\n1\n\u03bb\ni \u03bb\n2\nj\n\n1\n2\n\n2 ej, ei(cid:105)\n\n2 ) 1\n\nei(x) .\n\n(7)\n\nUsing [6, Thm. 8.4.7], we can write the gradient of the barycentric expression. We furthermore\nknow that the expression is strictly convex, thus the gradient at \u00aff equals zero if and only if \u00aff is the\nminimizer. Now let Id be the identity operator, then\n\n\u2207BC( \u00aff ) =\n\n(Tk \u2212 Id ) = 0,\n\nsubstituting in (7), we get\n\ni=1\n\nK\n\n1\n\n2 KiK\n\n1\n2\n\n(cid:17) 1\n\n2\n\n= K.\n\n(cid:16)\n\n\u03bei\n\nN(cid:88)\n\ni=1\n\nProof of Theorem 4. Use Prop. 6, the properties of a barycenter in a Hilbert space, and that the space\nof GPs is isometric to the cartesian product of L2(X) and the covariance operators.\n\nRemark 7. For the practical computations of barycenters of GDs approximating GPs, to be discussed\nbelow, a \ufb01xed-point iteration scheme with a guarantee of convergence exists [4, Thm. 4.2].\n\n5\n\n\fn(cid:88)\n\n\u221e(cid:88)\n\nk=1\n\n\u221e(cid:88)\n\nk=1\n\nn(cid:88)\n\nConvergence properties. Now, we show that the 2-Wasserstein metric for GPs can be approxi-\nmated arbitrarily well by the 2-Wasserstein metric for GDs. This is important, as in real-life we\nobserve \ufb01nite-dimensional representations of the covariance operators.\nLet {ei}\u221e\nand Kin of mi and Ki, i = 1, 2, on Vn = span(e1, ..., en) by\n\ni=1 be an orthonormal basis for L2(X). Then we de\ufb01ne the GDs given by restrictions min\n\nmin(x) =\n\n(cid:104)mi, ek(cid:105)ek(x), Kin\u03c6 =\n\n(cid:104)\u03c6, ek(cid:105)Kiek, \u2200\u03c6 \u2208 Vn, \u2200x \u2208 X ,\n\n(8)\n\nk=1\nand prove the following:\nTheorem 8. The 2-Wasserstein metric between GDs on \ufb01nite samples converges to the Wasserstein\nmetric between GPs, that is, if fin \u223c N (min, Kin), fi \u223c GP(mi, Ki) for i = 1, 2, then\n\nk=1\n\nn\u2192\u221e W 2\nlim\n\n2 (f1n, f2n) = W 2\n\n2 (f1, f2).\n\nBy the same argument, it also follows that W 2\ntopology.\nProof. Kin \u2192 Ki in operator norm as n \u2192 \u221e. Because taking a sum, product and square-root of\noperators are all continuous with respect to the operator norm, it follows that\n\n2 (\u00b7,\u00b7) is continuous in both arguments in operator norm\n\n1\n1 )\n2 .\nNote that for any sequence An \u2192 A with convergence in operator norm, we have\n\n1nK2nK\n\n1 K2K\n\n1n)\n\n2 \u2192 K1 + K2 \u2212 2(K\n\nK1n + K2n \u2212 2(K\n\n1\n2\n\n1\n2\n\n1\n2\n\n1\n2\n\n1\n\n|Tr A \u2212 Tr An| \u2264\n\n|(cid:104)(A \u2212 An)ek, ek(cid:105)| Cauchy-Schwarz\n\n\u2264\n\n(cid:107)(A \u2212 An)ek(cid:107)L2\n\nMCT\u2192 0 ,\n\n(9)\n\n(cid:107)(A \u2212 An)v(cid:107)L2 = 0 due to the convergence in operator norm. Here MCT stands\n\nas lim\n\nn\u2192\u221e sup\nv\u2208L2\n\n\u03c9(X)\n\nfor the monotone convergence theorem. Thus we have\n\nW 2\n\n2 (f1n, f2n) = d2\n\n2(m1n, m2n) + Tr (K1n + K2n \u2212 2(K\n2(m1, m2) + Tr (K1 + K2 \u2212 2(K\n2 (f1, f2).\n\nn\u2192\u221e\u2192 d2\n= W 2\n\n1\n2\n\n1\n2\n\n1nK2nK\n\n1n)\n\n1\n2\n\n1\n2 )\n\n1 K2K\n\n1 )\n\n1\n2\n\n1\n2 )\n\nThe importance of Proposition 8 is that it justi\ufb01es computations of distances using \ufb01nite representa-\ntions of GPs as approximations for the in\ufb01nite-dimensional case.\nNext, assuming the barycenter is non-degenerate, we show that we can also approximate the barycenter\nof a population of GPs by computing the barycenters of populations of GDs converging to these GPs.\nIn the degenerate case, see [20, Thm. 11].\nTheorem 9. Assuming the barycenter of a population of GPs is non-degenerate, then it varies\ncontinuously, that is, the map (f1, ..., fN ) (cid:55)\u2192 \u00aff is continuous in the operator norm. Especially, this\nimplies that the barycenter \u00affn of the \ufb01nite-dimensional restrictions {fin}N\nFirst, we show that if fi \u223c GP(mi, Ki) and \u00aff = GP( \u00afm, \u00afK), then that the map (K1, ..., KN ) (cid:55)\u2192 \u00afK\nis continuous. Continuity of (m1, ..., mN ) (cid:55)\u2192 \u00afm is clear.\nLet K be a covariance operator, denote its maximal eigenvalue by \u03bbmax(K). Note that this map is\nwell-de\ufb01ned, as K is also bounded, normal operator, thus \u03bbmax(K) = (cid:107)K(cid:107)op < \u221e holds. Now let\na = (K1, ..., KN ) be a population of covariance operators, denote ith as a(i) = Ki, then de\ufb01ne the\ncontinuous function \u03b2 and correspondence (a set valued map) \u03a6 as follows\n\ni=1 converges to \u00aff.\n\n, \u03a6 : a (cid:55)\u2192 K\u03b2(a) = {K \u2208 HS(H) | \u03b2(a)I \u2265 K \u2265 0}.\n\n(cid:32) N(cid:88)\n\n(cid:33)2\n(cid:112)\u03bbmax(a(i))\n\n\u03bei\n\n\u03b2 : a (cid:55)\u2192\n\ni=1\n\n6\n\n\fThen the \ufb01xed point of (6) can be found in \u03a6(a), as the map\n\nF (K) =\n\n\u03bei\n\nK\n\n1\n\n2 KiK\n\n1\n2\n\n(cid:16)\n\nN(cid:88)\n\ni=1\n\n(cid:17) 1\n\n2\n\n,\n\nlim\n\nn\u2192\u221e bn = b,\n\nis a compact operator, \u03a6(a) is bounded, and so the closure of F (\u03a6(a)) is compact. Furthermore, do\nnote that F is a map from \u03a6(a) to itself, so by Schauder\u2019s \ufb01xed point theorem, there exists a \ufb01xed\npoint.\nNow, we want to show that this correspondence is continuous in order to put the Maximum theorem to\nuse. A correspondence \u03a6 : A \u2192 B is upper hemi-continuous at a \u2208 A, if all convergent sequences\n(an) \u2208 A, (bn) \u2208 \u03a6(an) satisfy lim\nn\u2192\u221e an = a and b \u2208 \u03a6(a). The correspondence is\nlower hemi-continuous at a \u2208 A, if for all convergent sequences an \u2192 a in A and any b \u2208 \u03a6(a),\nthere is a subsequence ank, so that we have a sequence bk \u2208 \u03a6(ank ) which satis\ufb01es bk \u2192 b. If the\ncorrespondence is both upper and lower hemi-continuous, we say that it is continuous. For more\nabout the Maximum theorem and hemi-continuity, see [2].\nLemma 10. The correspondence \u03a6 : a (cid:55)\u2192 K\u03b2(a) is continuous as correspondence.\nProof. First, we show the correspondence is lower hemi-continuous. Let (an)\u221e\nn=1 be a sequence of\npopulations of covariance operators of size N, that converges an \u2192 a. Use the shorthand notation\n\u03b2n := \u03b2(an), then \u03b2n \u2192 \u03b2\u221e := \u03b2(a), and let b \u2208 \u03a6(a) = K\u03b2\u221e.\nPick subsequence (ank )\u221e\nk=1 is increasing or decreasing. If it was decreasing, then\nK\u03b2\u221e \u2286 K\u03b2nk\nfor every nk. Thus the proof would be \ufb01nished by choosing bk = b for every k.\n. Now let \u03b3(t) = (1 \u2212 t)b1 + tb,\nHence assume the sequence is increasing, so that K\u03b2nk\nwhere b1 \u2208 K\u03b21, and let tnk be the solution to (1 \u2212 t)\u03b21 + t\u03b2\u221e = \u03b2nk, then bk := \u03b3(tnk ) \u2208 K\u03b2nk\nand bk \u2192 b.\nFor upper hemicontinuity, assume that an \u2192 a, bn \u2208 K\u03b2n and that bn \u2192 b. Then using the\nde\ufb01nition of \u03a6, we get the positive sequence (cid:104)(\u03b2nI \u2212 bn)x, x(cid:105) \u2265 0 indexed by n, then by continuity\nand the positivity of this sequence it follows that\n\nk=1 so that (\u03b2nk )\u221e\n\n\u2286 K\u03b2nk+1\n\n0 \u2264 lim\n\nn\u2192\u221e(cid:104)(\u03b2nI \u2212 bn)x, x(cid:105) = (cid:104)(\u03b2\u221eI \u2212 b)x, x(cid:105).\n\nOne can check the criterion b \u2265 0 similarly, and so we are done.\n\nProof of Theorem 9. Now let a = (K1, ..., Kn), f (K, a) := (cid:80)N\n(cid:80)N\n\n2 (K, Ki) and F (K) :=\n2 , then the unique minimizer \u00afK of f is the \ufb01xed point of F . Furthermore, the\nclosure cl(F (K\u03b2(a))) is compact, a (cid:55)\u2192 cl(F (K\u03b2(a))) is a continuous correspondence as the closure\nof composition of two continuous correspondence. Additionally, we know that \u00afK \u2208 cl(F (K\u03b2(a))),\nso applying the maximum theorem, we have shown that the barycenter of a population of covariance\noperators varies continuously, i.e. the map (K1, ..., KN ) (cid:55)\u2192 \u00afK is continuous, \ufb01nishing the proof.\n\ni=1 \u03bei(K 1\n\ni=1 \u03beiW 2\n\n2 KiK 1\n\n2 ) 1\n\n4 Experiments\n\nWe illustrate the utility of the Wasserstein metric in two different applications: Processing of uncertain\nwhite-matter tracts estimated from DWI, and analysis of climate development via temperature curve\nGPs.\n\nExperimental setup. The white-matter tract GPs are estimated for a single subject from the\nHuman Connectome Project [15, 32, 35], using probabilistic shortest-path tractography [17]. See\nthe supplementary material for details on the data and its preprocessing. From daily minimum\ntemperatures measured at a set of 30 randomly sampled Russian metereological stations [9, 34],\nGP regression was used to estimate a GP temperature curve per year and station for the period\n1940 \u2212 2009 using maximum likelihood parameters. All code for computing Wasserstein distances\nand barycenters was implemented in MATLAB and ran on a laptop with 2,7 GHz Intel Core i5\nprocessor and 8 GB 1867 MHz DDR3 memory. On the temperature GP curves (represented by 50\nsamples), the average runtime of the 2-Wasserstein distance computation was 0.048 \u00b1 0.014 seconds\n(estimated from 1000 pairwise distance computations), and the average runtime of the 2-Wasserstein\nbarycenter of a sample of size 10 was 0.69 \u00b1 0.11 seconds (estimated from 200 samples).\n\n7\n\n\fWhite-matter tract processing. The inferior longitudinal fasiculus is a white-matter bundle which\nsplits into two separate bundles. Fig. 3 (top) shows the results of agglomerative hierarchical clustering\nof the GP tracts using average Wasserstein distance. The per-cluster Wasserstein barycenter can\nbe used to represent the tracts; its overlap with the individual GP mean curves is shown in Fig. 3\n(bottom).\nThe individual GP tracts are visualized via their mean curves, but they are in fact a population of GPs.\nTo con\ufb01rm that the two clusters are indeed different also when the covariance function is taken into\naccount, we perform a permutation test for difference between per-cluster Wasserstein barycenters,\nand already with 50 permutations we observe a p-value of p = 0.0196, con\ufb01rming that the two\nclusters are signi\ufb01cantly different at a 5% signi\ufb01cance level.\n\nQuantifying climate change. Using the Wasserstein\nbarycenters we perform nonparametric kernel regression to\nvisualize how yearly temperature curves evolve with time,\nbased on the Russian yearly temperature GPs. Fig. 4 shows\nsnapshots from this evolution, and a continuous movie ver-\nsion climate.avi is found in the supplementary material.\nThe regressed evolution indicates an increase in overall\ntemperature as we reach the \ufb01nal year 2009. To quan-\ntify this observation, we perform a permutation test using\nthe Wasserstein distance between population Wasserstein\nbarycenters to compare the \ufb01nal 10 years 2000-2009 with\nthe years 1940-1999. Using 50 permutations we obtain a\np-value of 0.0392, giving signi\ufb01cant difference in temper-\nature curves at a 95% con\ufb01dence level.\n\nSigni\ufb01cance. Note that the state-of-the-art in tract anal-\nysis as well as in functional data analysis would be to\nignore the covariance of the estimated curves and treat\nthe mean curves as observations. We contribute a frame-\nwork to incorporate the uncertainty into the population\nanalysis \u2013 but why would we want to retain uncertainty?\nIn the white-matter tracts, the GP covariance represents\nspatial uncertainty in the estimated curve trajectory. The\nindividual GPs represent connections between different\nendpoints. Thus, they do not represent observations of\nthe exact same trajectory, but rather of distinct, nearby\ntrajectories. It is common in diffusion MRI to represent\nsuch sets of estimated trajectories by a few prototype tra-\njectories for visualization and comparative analysis; we obtain prototypes through the Wasserstein\nbarycenter. To correctly interpret the spatial uncertainty, e.g. for a brain surgeon [8], it is crucial\nthat the covariance of the prototype GP represents the covariances of the individual GPs, and not\nsmaller. If you wanted to reduce uncertainty by increasing sample size, you would need more images,\nnot more curves \u2013 because the noise is in the image. But more images are not usually available. In\nthe climate data, the GP covariance models natural temperature variation, not measurement noise.\nIncreasing the sample size decreases the error of the temperature distribution, but should not decrease\nthis natural variation (i.e. the covariance).\n\nFigure 3: Top: The mean functions of\nthe individual GPs, colored by cluster\nmembership, in the context of the corre-\nsponding T1-weighted MRI slices. Bot-\ntom: The tract GP mean functions and\nthe cluster mean GPs with 95% con\ufb01-\ndence bounds.\n\n5 Discussion and future work\n\nWe have shown that the Wasserstein metric for GPs is both theoretically and computationally well-\nfounded for statistics on GPs: It de\ufb01nes unique barycenters, and allows ef\ufb01cient computations\nthrough \ufb01nite-dimensional representations. We have illustrated its use in two different applications:\nProcessing of uncertain estimates of white-matter trajectories in the brain, and analysis of climate\ndevelopment via GP representations of temperature curves. We have seen that the metric itself is\ndiscriminative for clustering and permutation testing, and we have seen how the GP barycenters allow\ntruthful interpretation of uncertainty in the white matter tracts and of variation in the temperature\ncurves.\n\n8\n\n\fFigure 4: Snapshots from the kernel regression giving yearly temperature curves 1940-2009. We\nobserve an apparent temperature increase which is con\ufb01rmed by the permutation test.\n\nFuture work includes more complex learning algorithms, starting with preprocessing tools such as\nPCA [31], and moving on to supervised predictive models. This includes a better understanding of\nthe potentially Riemannian structure of the in\ufb01nite-dimensional Wasserstein space, which would\nenable us to draw on existing results for learning with manifold-valued data [21].\nThe Wasserstein distance allows the inherent uncertainty in the estimated GP data points to be\nappropriately accounted for in every step of the analysis, giving truthful analysis and subsequent\ninterpretation. This is particularly important in applications where uncertainty or variation is crucial:\nVariation in temperature is an important feature in climate change, and while estimated white-matter\ntrajectories are known to be unreliable, they are used in surgical planning, making uncertainty about\ntheir trajectories a highly relevant parameter.\n\n6 Acknowledgements\n\nThis research was supported by Centre for Stochastic Geometry and Advanced Bioimaging, funded\nby a grant from the Villum Foundation. Data were provided [in part] by the Human Connec-\ntome Project, WU-Minn Consortium (Principal Investigators: David Van Essen and Kamil Ugurbil;\n1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for\nNeuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington Uni-\nversity. The authors would also like to thank Mads Nielsen for valuable discussions and supervision.\nFinally, the authors would like to thank Victor Panaretos for valuable discussions and, in particular,\nfor pointing out an error in an earlier version of the manuscript.\n\nReferences\n[1] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis,\n\n[2] C. Aliprantis and K. Border. In\ufb01nite dimensional analysis: a hitchhiker\u2019s guide. Studies in Economic\n\n43(2):904\u2013924, 2011.\n\nTheory, 4, 1999.\n\n[3] P. \u00c1lvarez-Esteban, E. Del Barrio, J. Cuesta-Albertos, C. Matr\u00e1n, et al. Uniqueness and approximate com-\nputation of optimal incomplete transportation plans. In Annales de l\u2019Institut Henri Poincar\u00e9, Probabilit\u00e9s\net Statistiques, volume 47, pages 358\u2013375. Institut Henri Poincar\u00e9, 2011.\n\n[4] P. C. \u00c1lvarez-Esteban, E. del Barrio, J. Cuesta-Albertos, and C. Matr\u00e1n. A \ufb01xed-point approach to\nbarycenters in Wasserstein space. Journal of Mathematical Analysis and Applications, 441(2):744\u2013762,\n2016.\n\n[5] L. Ambrosio and N. Gigli. A user\u2019s guide to optimal transport. In Modelling and optimisation of \ufb02ows on\n\nnetworks, pages 1\u2013155. Springer, 2013.\n\n[6] L. Ambrosio, N. Gigli, and G. Savar\u00e9. Gradient \ufb02ows: in metric spaces and in the space of probability\n\nmeasures. Springer Science & Business Media, 2008.\n\n[7] W. Arveson. A short course on spectral theory, volume 209. Springer Science & Business Media, 2006.\n[8] J. Berman. Diffusion MR tractography as a tool for surgical planning. Magnetic resonance imaging clinics\n\nof North America, 17(2):205\u2013214, 2009.\n\n[9] O. Bulygina and V. Razuvaev. Daily temperature and precipitation data for 518 russian meteorological\nstations. Carbon Dioxide Information Analysis Center, Oak Ridge National Laboratory, US Department of\nEnergy, Oak Ridge, Tennessee, 2012.\n\n[10] J. Cuesta-Albertos, C. Matr\u00e1n-Bea, and A. Tuero-Diaz. On lower bounds for the l2-Wasserstein metric in\n\na Hilbert space. Journal of Theoretical Probability, 9(2):263\u2013283, 1996.\n\n9\n\n\f[11] D. Dowson and B. Landau. The Fr\u00e9chet distance between multivariate normal distributions. Journal of\n\nmultivariate analysis, 12(3):450\u2013455, 1982.\n\n[12] M. Faraki, M. T. Harandi, and F. Porikli. Approximate in\ufb01nite-dimensional region covariance descriptors\nfor image classi\ufb01cation. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International\nConference on, pages 1364\u20131368. IEEE, 2015.\n\n[13] M. Gelbrich. On a formula for the L2 Wasserstein metric between measures on Euclidean and Hilbert\n\nspaces. Mathematische Nachrichten, 147(1):185\u2013203, 1990.\n\n[14] C. R. Givens, R. M. Shortt, et al. A class of Wasserstein metrics for probability distributions. The Michigan\n\nMathematical Journal, 31(2):231\u2013240, 1984.\n\n[15] M. F. Glasser, S. N. Sotiropoulos, J. A. Wilson, T. S. Coalson, B. Fischl, J. L. Andersson, J. Xu, S. Jbabdi,\nM. Webster, J. R. Polimeni, et al. The minimal preprocessing pipelines for the Human Connectome project.\nNeuroimage, 80:105\u2013124, 2013.\n\n[16] M. Harandi, M. Salzmann, and F. Porikli. Bregman divergences for in\ufb01nite dimensional covariance\nmatrices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n1003\u20131010, 2014.\n\n[17] S. Hauberg, M. Schober, M. Liptrot, P. Hennig, and A. Feragen. A random Riemannian metric for\nprobabilistic shortest-path tractography. In International Conference on Medical Image Computing and\nComputer-Assisted Intervention, pages 597\u2013604. Springer, 2015.\n\n[18] M. Knott and C. S. Smith. On the optimal mapping of distributions. Journal of Optimization Theory and\n\nApplications, 43(1):39\u201349, 1984.\n\n[19] M. L\u00ea, J. Unkelbach, N. Ayache, and H. Delingette. GPSSI: Gaussian process for sampling segmentations\nof images. In International Conference on Medical Image Computing and Computer-Assisted Intervention,\npages 38\u201346. Springer, 2015.\n\n[20] V. Masarotto, V. M. Panaretos, and Y. Zemel. Procrustes metrics on covariance operators and optimal\n\ntransportation of gaussian processes. arXiv preprint arXiv:1801.01990, 2018.\n\n[21] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks\non Riemannian manifolds. In Proceedings of the IEEE international conference on computer vision\nworkshops, pages 37\u201345, 2015.\n\n[22] I. Olkin and F. Pukelsheim. The distance between two random vectors with given dispersion matrices.\n\nLinear Algebra and its Applications, 48:257\u2013263, 1982.\n\n[23] D. Pigoli, J. A. Aston, I. L. Dryden, and P. Secchi. Distances and inference for covariance operators.\n\nBiometrika, 101(2):409\u2013422, 2014.\n\n[24] S. Pujol, W. Wells, C. Pierpaoli, C. Brun, J. Gee, G. Cheng, B. Vemuri, O. Commowick, S. Prima, A. Stamm,\net al. The DTI challenge: toward standardized evaluation of diffusion tensor imaging tractography for\nneurosurgery. Journal of Neuroimaging, 25(6):875\u2013882, 2015.\n\n[25] M. H. Quang and V. Murino. From covariance matrices to covariance operators: Data representation from\n\ufb01nite to in\ufb01nite-dimensional settings. In Algorithmic Advances in Riemannian Geometry and Applications,\npages 115\u2013143. Springer, 2016.\n\n[26] M. H. Quang, M. San Biagio, and V. Murino. Log-Hilbert-Schmidt metric between positive de\ufb01nite\noperators on Hilbert spaces. In Advances in Neural Information Processing Systems, pages 388\u2013396, 2014.\n[27] B. S. Rajput. Gaussian measures on Lp spaces, 1 \u2264 p < \u221e. Journal of Multivariate Analysis, 2(4):382\u2013\n\n403, 1972.\n\n[28] S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian processes for time-series\n\nmodelling. Phil. Trans. R. Soc. A, 371(1984):20110550, 2013.\n\n[29] M. Schober, D. K. Duvenaud, and P. Hennig. Probabilistic ODE solvers with Runge-Kutta means. In\n\nAdvances in neural information processing systems, pages 739\u2013747, 2014.\n\n[30] M. Schober, N. Kasenburg, A. Feragen, P. Hennig, and S. Hauberg. Probabilistic shortest path tractography\nin DTI using Gaussian Process ODE solvers. In International Conference on Medical Image Computing\nand Computer-Assisted Intervention, pages 265\u2013272. Springer, 2014.\n\n[31] V. Seguy and M. Cuturi. Principal geodesic analysis for probability measures under the optimal transport\n\nmetric. In Advances in Neural Information Processing Systems, pages 3312\u20133320, 2015.\n\n[32] S. Sotiropoulos, S. Moeller, S. Jbabdi, J. Xu, J. Andersson, E. Auerbach, E. Yacoub, D. Feinberg, K. Set-\nsompop, L. Wald, et al. Effects of image reconstruction on \ufb01ber orientation mapping from multichannel\ndiffusion MRI: reducing the noise \ufb02oor using SENSE. Magnetic resonance in medicine, 70(6):1682\u20131689,\n2013.\n\n[33] A. Takatsu et al. Wasserstein geometry of Gaussian measures. Osaka Journal of Mathematics, 48(4):1005\u2013\n\n1026, 2011.\n\n[34] R. Tatusko and J. A. Mirabito. Cooperation in climate research: An evaluation of the activities conducted\nunder the US-USSR agreement for environmental protection since 1974. National Climate Program Of\ufb01ce,\n1990.\n\n[35] D. C. Van Essen, S. M. Smith, D. M. Barch, T. E. Behrens, E. Yacoub, K. Ugurbil, W.-M. H. Consortium,\n\net al. The wu-minn Human Connectome project: an overview. Neuroimage, 80:62\u201379, 2013.\n\n10\n\n\f[36] C. Villani. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.\n[37] D. Wassermann, L. Bloy, E. Kanterakis, R. Verma, and R. Deriche. Unsupervised white matter \ufb01ber\nclustering and tract probability map generation: Applications of a Gaussian process framework for white\nmatter \ufb01bers. NeuroImage, 51(1):228\u2013241, 2010.\n\n[38] X. Yang and M. Niethammer. Uncertainty quanti\ufb01cation for LDDMM using a low-rank Hessian approxi-\nmation. In International Conference on Medical Image Computing and Computer-Assisted Intervention,\npages 289\u2013296. Springer, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2891, "authors": [{"given_name": "Anton", "family_name": "Mallasto", "institution": "University of Copenhagen"}, {"given_name": "Aasa", "family_name": "Feragen", "institution": "University of Copenhagen"}]}