{"title": "Minimax Statistical Learning with Wasserstein distances", "book": "Advances in Neural Information Processing Systems", "page_first": 2687, "page_last": 2696, "abstract": "As opposed to standard empirical risk minimization (ERM), distributionally robust optimization aims to minimize the worst-case risk over a larger ambiguity set containing the original empirical distribution of the training data. In this work, we describe a minimax framework for statistical learning with ambiguity sets given by balls in Wasserstein space. In particular, we prove generalization bounds that involve the covering number properties of the original ERM problem. As an illustrative example, we provide generalization guarantees for transport-based domain adaptation problems where the Wasserstein distance between the source and target domain distributions can be reliably estimated from unlabeled samples.", "full_text": "Minimax statistical learning\nwith Wasserstein distances\n\nJaeho Lee\n\nMaxim Raginsky\n\n{jlee620, maxim}@illinois.edu\u21e4\n\nAbstract\n\nAs opposed to standard empirical risk minimization (ERM), distributionally robust\noptimization aims to minimize the worst-case risk over a larger ambiguity set\ncontaining the original empirical distribution of the training data. In this work,\nwe describe a minimax framework for statistical learning with ambiguity sets\ngiven by balls in Wasserstein space. In particular, we prove generalization bounds\nthat involve the covering number properties of the original ERM problem. As\nan illustrative example, we provide generalization guarantees for transport-based\ndomain adaptation problems where the Wasserstein distance between the source\nand target domain distributions can be reliably estimated from unlabeled samples.\n\n1\n\nIntroduction\n\nIn the traditional paradigm of statistical learning [20], we have a class P of probability measures on\na measurable instance space Z and a class F of measurable functions f : Z ! R+. Each f 2 F\nquanti\ufb01es the loss of some decision rule or a hypothesis applied to instances z 2 Z, so, with a slight\nabuse of terminology, we will refer to F as the hypothesis space. The (expected) risk of a hypothesis\nf on instances generated according to P is given by\n\nR(P, f ) := EP [f (Z)] =ZZ\n\nf (z)P (dz).\n\nGiven an n-tuple Z1, . . . , Zn of i.i.d. training examples drawn from an unknown P 2 P, the objective\nis to \ufb01nd a hypothesis bf 2 F whose risk R(P,bf ) is close to the minimum risk\n\nR(P, f )\n\nR\u21e4(P, F) := inf\nf2F\n\n(1)\n\nwith high probability. Under suitable regularity assumptions, this objective can be accomplished via\nEmpirical Risk Minimization (ERM) [20, 13]:\n\nR(Pn, f ) =\n\n1\nn\n\nnXi=1\n\nf (Zi) ! min, f 2 F\n\n(2)\n\nnPn\n\ni=1 Zi is the empirical distribution of the training examples.\n\nwhere Pn := 1\nRecently, however, an alternative viewpoint has emerged, inspired by ideas from robust statistics and\nrobust stochastic optimization. In this distributionally robust framework, instead of solving the ERM\nproblem (2), one aims to solve the minimax problem\n\nsup\n\nQ2A(Pn)\n\nR(Q, f ) ! min, f 2 F\n\n(3)\n\n\u21e4Department of Electrical and Computer Engineering and Coordinated Science Laboratory, University of\nIllinois, Urbana, IL 61801, USA. This work was supported in part by NSF grant nos. CIF-1527 388 and\nCIF-1302438, and in part by the NSF CAREER award 1254041.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwhere A(Pn) is an ambiguity set containing the empirical distribution Pn and, possibly, the unknown\nprobability law P either with high probability or almost surely. The ambiguity sets serve as a\nmechanism for compensating for the uncertainty about P that inherently arises due to having only a\n\ufb01nite number of samples to work with, and can be constructed in a variety of ways, e.g. via moment\nconstraints [9], f-divergence balls [8], and Wasserstein balls [16, 11, 5]. However, with the exception\nof the recent work by Farnia and Tse [9], the minimizer of (3) is still evaluated under the standard\nstatistical risk minimization paradigm.\nIn this work, we instead study the scheme where the statistical risk minimization criterion (1) is\nreplaced with the local minimax risk\n\ninf\nf2F\n\nsup\n\nR(Q, f )\n\nQ2A(P )\n\nat P , where the ambiguity set A(P ) is taken to be a Wasserstein ball centered at P . As we will argue\nbelow, this change of perspective is natural when there is a possibility of domain drift, i.e., when the\nlearned hypothesis is evaluated on a distribution Q which may be different from the distribution P\nthat was used to generate the training data.\nThe rest of this paper is organized as follows: In Section 2, we formally present the notion of\nlocal minimax risk and discuss its relationship to the statistical risk, which allows us to assess the\nperformance of minimax-optimal hypothesis in speci\ufb01c domains. We also provide an example to\nillustrate the role of ambiguity sets in rejecting nonrobust hypotheses.\nIn Section 3, we show that the hypothesis learned with the Empirical Risk Minimization (ERM)\nprocedure based on the local minimax risk closely achieves the optimal local minimax risk. In\nparticular, we provide a data-dependent bound on the generalization error, which behaves like the\nbound for ordinary ERM in the no-ambiguity regime (Theorem 1), and excess risk bounds under\nuniform smoothness assumptions on F (Theorem 2) and a less restrictive assumption that F contains\nat least one smooth hypothesis (Theorem 3).\nIn Section 4, we provide an alternative perspective on domain adaptation based on the minimax\nstatistical learning under the framework of Courty et al. [6], where the domain drift is due to an\nunknown transformation of the feature space that preserves the conditional distribution of the labels\ngiven the features. Completely bypassing the estimation of the transport map, we provide a proper\nexcess risk bound that compares the risk of the learned hypothesis to the minimal risk achievable\nwithin the given hypothesis class on the target domain (Theorem 4). To the best of our knowledge, all\nexisting theoretical results on domain adaptation are stated in terms of the discrepancy between the\nbest hypotheses on the source and on the target domains.\nAll proofs are deferred to the appendix.\n\n2 Local minimax risk with Wasserstein ambiguity sets\n\nWe assume that the instance space Z is a Polish space (i.e., a complete separable metric space) with\nmetric dZ. We denote by P(Z) the space of all Borel probability measures on Z, and by Pp(Z) with\np 1 the space of all P 2 P(Z) with \ufb01nite pth moments. The metric structure of Z can be used to\nde\ufb01ne a family of metrics on the spaces Pp(Z) [21]:\nDe\ufb01nition 1. For p 1, the p-Wasserstein distance between P, Q 2 Pp(Z) is\n\nWp(P, Q) :=\n\ninf\n\nM (Z\u21e5\u00b7)=QEM (dp\n\nM (\u00b7\u21e5Z)=P\n\nZ(Z, Z0)]1/p ,\n\n(4)\n\nwhere the in\ufb01mum is taken over all couplings of P and Q, i.e. probability measures M on the product\nspace Z \u21e5 Z with the given marginals P and Q.\nRemark 1. Wasserstein distances arise in the problem of optimal transport: for any coupling M of P\nand Q, the conditional distribution MZ0|Z can be viewed as a randomized policy for \u2018transporting\u2019 a\nunit quantity of some material from a random location Z \u21e0 P to another location Z0, while satisfying\nthe marginal constraint Z0 \u21e0 Q. If the cost of transporting a unit of material from z 2 Z to z0 2 Z is\ngiven by dp\n\np (P, Q) is the minimum expected tranport cost.\n\nZ(z, z0), then W p\n\n2\n\n\fWe now consider a learning problem (P, F) with P = Pp(Z) for some p 1. Following [16, 17, 11],\nwe let the ambiguity set A(P ) be the p-Wasserstein ball of radius % 0 centered at P :\n\nwhere the radius %> 0 is a tunable parameter. We then de\ufb01ne the local worst-case risk of f at P ,\n\nA(P ) = BW\n\n%,p(P ) := {Q 2 Pp(Z) : Wp(P, Q) \uf8ff %} ,\n\nand the local minimax risk at P :\n\nR%,p(P, f ) := sup\nQ2BW\n\n%,p(P )\n\nR(Q, f ),\n\nR\u21e4%,p(P, F) := inf\nf2F\n\nR%,p(P, f ).\n\n2.1 Local worst-case risk vs. statistical risk\nWe give a couple of inequalities relating the local worst-case (or local minimax) risks and the usual\nstatistical risks, which will be useful in Section 4. The \ufb01rst one is a simple consequence of the\nKantorovich duality theorem from the theory of optimal transport [21]:\nProposition 1. Suppose that f is L-Lipschitz, i.e., |f (z) f (z0)|\uf8ff LdZ(z, z0) for all z, z0 2 Z.\nThen, for any Q 2 BW\n\n%,p(P ),\n\nR(Q, f ) \uf8ff R%,p(P, f ) \uf8ff R(Q, f ) + 2L%.\n\nAs an example, consider the problem of binary classi\ufb01cation with hinge loss: Z = X \u21e5 Y, where X\nis an arbitrary feature space, Y = {1, +1}, and the hypothesis space F consists of all functions of\nthe form f (z) = f (x, y) = max{0, 1 yf0(x)}, where f0 : X ! R is a candidate predictor. Then,\nsince the function u 7! max{0, 1 u} is Lipschitz-continuous with constant 1, we can write\n|f (x, y) f (x0, y0)|\uf8ff| yf0(x) y0f0(x0)|\uf8ff 2kf0kX1{y 6= y0} + |f0(x) f0(x0)|,\n\nwhere kf0kX := supx2X |f0(x)|. If kf0kX < 1 and if f0 is L0-Lipschitz with respect to some\nmetric dX on X, then it follows that f is Lipschitz with constant max{2kf0kX, L0} with respect to\nthe product metric\n\ndZ(z, z0) = dZ((x, y), (x0, y0)) := dX(x, x0) + 1{y 6= y0}.\n\nNext we consider the case when the function f is smooth but not Lipschitz-continuous. Since we\nare working with general metric spaces that may lack an obvious differentiable structure, we need\nto \ufb01rst introduce some concepts from metric geometry [1]. A metric space (Z, dZ) is a geodesic\nspace if for every two points z, z0 2 Z there exists a path : [0, 1] ! Z, such that (0) = z,\n(1) = z0, and dZ((s), (t)) = (t s)\u00b7 dZ((0), (1)) for all 0 \uf8ff s \uf8ff t \uf8ff 1 (such a path is called\na constant-speed geodesic). A functional F : Z ! R is geodesically convex if for any pair of points\nz, z0 2 Z there is a constant-speed geodesic , so that\n\nF ((t)) \uf8ff (1 t)F ((0)) + tF ((1)) = (1 t)F (z) + tF (z0),\nAn upper gradient of a Borel function f : Z ! R is a functional Gf : Z ! R+, such that for any\npair of points z, z0 2 Z there exists a constant-speed geodesic obeying\nGf ((t))dt \u00b7 dZ(z, z0).\n\n8t 2 [0, 1].\n\n|f (z0) f (z)|\uf8ff Z 1\n\nWith these de\ufb01nitions at hand, we have the following:\nProposition 2. Suppose that f has a geodesically convex upper gradient Gf . Then\n%,p(P )kGf (Z)kLq(Q),\nsup\nQ2BW\n\nR(Q, f ) \uf8ff R%,p(P, f ) \uf8ff R(Q, f ) + 2%\nwhere 1/p + 1/q = 1, and k\u00b7k Lq(Q) := (EQ| \u00b7 |q)1/q.\nConsider the setting of regression with quadratic loss: let X be a convex subset of Rd, let Y = [B, B]\nfor some 0 < B < 1, and equip Z = X \u21e5 Y with the Euclidean metric\n\n(5)\n\ndZ(z, z0) =qkx x0k2\n\n(6)\nSuppose that the functions f 2 F are of the form f (z) = f (x, y) = (yh(x))2 with h 2 C1(Rd, R),\nsuch that khkX \uf8ff M < 1 and krh(x)k2 \uf8ff Lkxk2 for some 0 < L < 1. Then Proposition 2\nleads to the following:\n\nz = (x, y), z0 = (x0, y0).\n\n2 + |y y0|2,\n\n0\n\n3\n\n\fProposition 3.\n\nR(Q, f ) \uf8ff R%,2(P, f ) \uf8ff R(Q, f ) + 4%(B + M )\u21e31 + L\n\nsup\nQ2BW\n%,2(P )\n\nQ,X\u2318,\n\nwhere Q,X := EQkXk2 for Z = (X, Y ) \u21e0 Q.\n2.2 An illustrative example: % as an exploratory budget\n\nBefore providing formal theoretical guarantees for ERM based on the local minimax risk R%,p(Pn, f )\nin Section 3, we give a stylized yet insightful example to illustrate the key difference between the\nordinary ERM and the local minimax ERM. In a nutshell, the local minimax ERM utilizes the\nWasserstein radius % as an exploratory budget to reject hypotheses overly sensitive to domain drift.\nConsider Z \u21e0 Unif[0, 1] =: P on data space Z = [0, 2], along with the hypothesis class F with only\ntwo hypotheses:\n\nf0(z) = 1,\n\nf1(z) =\u21e20,\n\n\u21b5,\n\nz 2 [0, 1)\nz 2 [1, 2]\n\nfor some \u21b5 1. Notice that, if the training data are drawn from Z, the ordinary ERM will always\nreturn f1, the hypothesis that is not robust against small domain drifts, while we are looking for a\nstructured procedure that will return f0, a hypothesis that works well for probability distributions\n\u2018close\u2019 to the data-generating distribution Unif[0, 1].\nThe success of minimax learning depends solely on the ability to transport some weight from a nearby\ntraining sample to 1, the region where nonrobust f1 starts to perform poorly. Speci\ufb01cally, the minimax\nlearning is \u2018successful\u2019 when R%,p(Pn, f0) = 1 is smaller than R%,p(Pn, f1) \u21e1 \u21b5%p/(1 max Zi)p,\nwhich happens with probability 1 (1 %\u21b51/p)n.\nWe make following key observations:\n\u2022 While smaller % leads to the smaller nontrivial excess risk R%,p(P, f1) R%,p(P, f0), it also leads\nto a slower decay of error probability. As a result, for a given %, we can come up with a hypothesis\nclass maximizing the excess risk at target % with excess risk behaving roughly as %p2/(p+1)\nwithout affecting the Rademacher average of the class (see supplementary Appendix B for details).\n\u2022 It is possible to guarantee smooth behavior of the ERM hypothesis without having uniform\nsmoothness assumptions on F; if there exists a single smooth hypothesis f0, it can be used as a\nbaseline comparison to reject nonsmooth hypotheses. We build on this idea in Section 3.3.\n\n3 Guarantees for empirical risk minimization\n\nLet Z1, . . . , Zn be an n-tuple of i.i.d. training examples drawn from P . In this section, we analyze\nthe performance of the local minimax ERM procedure\n\nR%,p(Pn, f ).\n\nbf := arg min\n\nf2F\n\nThe following strong duality result due to Gao and Kleywegt [11] will be instrumental:\nProposition 4. For any upper semicontinuous function f : Z ! R and for any Q 2 Pp(Z),\nwhere ',f (z) := supz02Zf (z0) \u00b7 dp\n\n0 {%p + EQ[',f (Z)]} ,\nZ(z, z0) .\n\nR%,p(Q, f ) = min\n\n3.1 Data-dependent bound on generalization error\n\n(7)\n\n(8)\n\nWe begin by imposing standard regularity assumptions (see, e.g., [7]) which allow us to invoke\nconcentration-of-meausre results for empirical processes.\nAssumption 1. The instance space Z is bounded: diam(Z) := supz,z02Z dZ(z, z0) < 1.\n\n4\n\n\fAssumption 2. The functions in F are upper semicontinuous and uniformly bounded: 0 \uf8ff f (z) \uf8ff\nM < 1 for all f 2 F and z 2 Z.\nAs a complexity measure of the hypothesis class F, we use the entropy integral [19]\n\nC(F) :=Z 1\n\n0 plog N(F,k\u00b7k 1, u)du,\n\nwhere N(F,k\u00b7k 1,\u00b7) denotes the covering number of F in the uniform metric kf f0k1 =\nsupz2Z |f (z) f0(z)|.\nThe bene\ufb01ts of using the entropy integral C(F) instead of usual complexity measures such as\nRademacher or Gaussian complexity [3] are twofold: (1) C(F) takes into account the behavior of\nhypotheses outside the support of the data-generating distribution P , and thus can be applied for\nthe assessment of local worst-case risk; (2) Rademacher complexity of ',f can be upper-bounded\nnaturally via C(F) and the covering number of a suitable bounded subset of [0,1).\nWe are now ready to give our data-dependent bound on R%,p(P, f ):\nTheorem 1. For any F, P satisfying Assumptions 1\u20132 and for any t > 0,\n\nP\u27139f 2 F : R%,p(P, f ) > min\n\nand\n\nP\u27139f 2 F : R%,p(Pn, f ) > min\n\n+\n\n+\n\nM t\n\n24C(F)\npn\n\n0(( + 1)%p + EPn[',f (Z)] +\npn\u25c6 \uf8ff 2 exp(2t2)\n0(( + 1)%p + EP [',f (Z)] +\npn\u25c6 \uf8ff 2 exp(2t2).\n\n24C(F)\npn\n\nM t\n\n+\n\n+\n\nMplog( + 1)\n\npn\n\n)\n\nMplog( + 1)\n\npn\n\n)\n\nNotice that Theorem 1 is in the style of data-dependent generalization bounds for margin cost function\nclass [14], often used for the analysis of voting methods or support vector machines [2].\nRemark 2. When % = 0, we recover the behavior of the usual statistical risk R(P, f ). Speci\ufb01cally,\nit is not hard to show from the de\ufb01nition of ',f that EPn[',f ] = EPn[f ] holds for all\n\n bn := max\n\n1\uf8ffi\uf8ffn\n\nsup\nz02Z\n\nf (z0) f (Zi)\ndp\nZ(z0, Zi)\n\n.\n\nfor some \u21e4 bn.\n\nIn that case, when % = 0, the generalization error converges to zero at the rate of 1/pn with usual\ncoef\ufb01cients from the Dudley\u2019s entropy integral [19] and McDiarmid\u2019s inequality, plus an added term\nof order Mplog \u21e4pn\n3.2 Excess risk bounds with uniform smoothness\nAs evident from Remark 2, if we have a priori knowledge that the hypothesis selected by the minimax\nERM procedure (7) is smooth with respect to the underlying metric, then we can restrict the feasible\nvalues of to provide data-independent guarantees on generalization error, which vanishes to 0 as\nn ! 1. Let us start by imposing the following \u2018uniform smoothness\u2019 on F:\nAssumption 3. The functions in F are L-Lipschitz: supz6=z0\nOne motivation for Assumption 3 is the following bound on the excess risk: whenever the solution of\nthe original ERM \u02dcf = arg minf2FPn\n\n(9)\nwhere the right-hand side is the sum of excess risk of ordinary ERM, and the worst-case deviation of\nrisk due to the ambiguity. The bound (9) is particularly useful when both % is and n are small, but it\ndoes not vanish as n ! 1.\nThe following lemma enables the control of in\ufb01mum-achieving dual parameter with respect to the\ntrue and empirical distribution:\n\nR%,p(P, \u02dcf ) R\u21e4%,p(P, F) \uf8ff R(P, \u02dcf ) R\u21e4(P, F) + L%,\n\ni=1 f (zi) is L-Lipschitz, Kantorovich duality gives us\n\nf (z0)f (z)\ndZ(z0,z) \uf8ff L for all f 2 F.\n\n5\n\n\fLemma 1. Fix some Q 2 Pp(Z), and de\ufb01ne \u02dcf 2 F and \u02dc 0 via\n\n\u02dcf := arg min\n\nR%,p(Q, f )\n\nand\n\n\u02dc := arg min\n\nf2F\n\n0 n%p + EQ[', \u02dcf (Z)]o .\n\n48C(F)\npn\n\nR%,p(P,bf ) R\u21e4%,p(P, F) \uf8ff\n\nThen under Assumptions 1\u20133, we have \u02dc \uf8ff L%(p1).\nThen, we can use the Dudley entropy integral arguments [19] on the joint search space of and f to\nget the following theorem:\nTheorem 2. Under Assumptions 1\u20133, the following holds with probability at least 1 :\n.\n\n48L \u00b7 diam(Z)p\n\n(10)\n\n+\n\npn \u00b7 %p1 + 3Mr log(2/)\n\n2n\n\nRemark 3. The adversarial training procedure appearing in a concurrent work of Sinha et al. [18]\ncan be interpreted as a relaxed version of local minimax ERM, where we consider to be \ufb01xed (to\nenhance implementability), rather than explicitly searching for an optimal . In such case, Lemma 1\nmay provide a guideline for the selection of parameter ; for example, one might run the \ufb01xed-\nalgorithm over a suf\ufb01ciently \ufb01ne grid of on the interval [0, L% (p1)] to approximate the local\nminimax ERM.\nNote that when p = 1, we get a %-free bound of order 1/pn, recovering the correct rate of ordinary\nERM as % = 0. On the other hand, Theorem 2 cannot be used to recover the rate of ordinary ERM for\np > 1. This phenomenon is due to the fact that we are using the Lipschitz assumption on F, which is\na data-independent constraint on the scale of the trade-off between f (z0) f (z) and dZ(z0, z). For\np > 1, one may also think of a similar data-independent (or, worst-case) constraint\n\nf (z0) f (z)\ndp\nZ(z0, z)\n\nsup\nz,z0\n\n< +1.\n\nHowever, this holds only if f is constant, even in the simplest case Z \u2713 R.\n3.3 Excess risk bound with minimal assumptions\nThe illustrative example presented in Section 2.2 implies that the minimax learning might be possible\neven when the functions in F are not uniformly Lipschitz, but there exists at least one smooth\nhypothesis (at least, except for the regime % ! 0). Based on that observation, we now consider a\nweaker alternative to Assumption 3:\nAssumption 4. There exists a hypothesis f0 2 F, such that, for all z 2 Z, f0(z) \uf8ff C0dp\nZ(z, z0) for\nsome C0 0 and z0 2 Z.\nAssumption 4 guarantees the existence of a hypothesis with smooth behavior with respect to the\nunderlying metric dZ; on the other hand, smoothness is not required for every f 2 F, and thus\nAssumption 4 is particularly useful when paired with a rich class F.\nIt is not dif\ufb01cult to see that Assumption 4 holds for most common hypothesis classes. As an\nexample, consider again the setting of regression with quadratic loss as in Proposition 3; the\nfunctions f 2 F are of the form f (z) = f (x, y) = (y h(x))2, where h runs over some given\nclass of candidate predictors that contains constants. Then, we can take h0(x) \u2318 0, in which case\nf0(z) = (h0(x) y)2 = |y|2 \uf8ff d2\nUnder Assumption 4, we can prove the following counterpart of Lemma 1:\nLemma 2. Fix some Q 2 Pp(Z). De\ufb01ne \u02dcf 2 F and \u02dc 0 via\n\u02dc := arg min\n\nZ(z, z0) for all z0 of the form (x, 0) 2 X \u21e5 Y.\n\n\u02dcf := arg min\n\nR%,p(Q, f )\n\nand\n\n0 n%p + EQ[', \u02dcf (Z)]o .\n\nThen, under Assumptions 1,2,4, \u02dc \uf8ff C02p1 (1 + (diam(Z)/%)p).\nAn intuition behind Lemma 2 is to interpret the Wasserstein perturbation % as a regularization\nparameter to thin out hypotheses with non-smooth behavior around Q by comparing it to f0. As %\ngrows, a smaller dual parameter is suf\ufb01cient to control the adversarial behavior.\nWe can now give a performance guarantee for the ERM procedure (7):\n\nf2F\n\n6\n\n\fTheorem 3. Under Assumptions 1,2,4, the following holds with probability at least 1 :\nR%,p(P,bf ) R\u21e4%,p(P, F) \uf8ff\n\n\u27131 +\u2713 diam(Z)\n\n\u25c6p\u25c6 + 3Mr log(2/)\n\n24C0(2 diam(Z))p\n\n48C(F)\npn\n\n2n\n(11)\n\npn\n\n+\n\n%\n\n.\n\nRemark 4. The second term decreases as % grows, which is consistent with the phenomenon\nillustrated in Section 2.2. Also note that the excess risk bound of [9] shows the same behavior as\nTheorem 3, where in that case % is the slack in the moment constraints de\ufb01ning the ambiguity set.\nWhile larger ambiguity can be helpful for learnability in this sense, note that the risk inequalities\nof Sec 2.1 imply that R%,p(P, f ) R(P, f ) can be bigger with larger %. Using these two elements,\none can provide domain-speci\ufb01c excess risk bounds which explicitly describe the interplay of both\nelements with ambiguity (see Sec 4).\n\n3.4 Example bounds\n\n+\n\np2n\n\nIn this subsection, we illustrate the use of Theorem 2 when (upper bounds on) the covering numbers\nfor the hypothesis class F are available. Throughout this section, we continue to work in the setting\nof regression with quadratic loss as in Proposition 3; we let X = {x 2 Rd : kxk2 \uf8ff r0} be a ball\nof radius r0 in Rd centered at the origin, let Y = [B, B] for some B > 0, and equip Z with the\nEuclidean metric (6). Also, we take p = 1.\nWe \ufb01rst consider a simple neural network class F consisting of functions of the form f (z) =\n0 x))2, where s : R ! R is a bounded smooth nonlinearity with s(0) = 0 and\nf (x, y) = (y s(f T\nwith bounded \ufb01rst derivative, and where f0 takes values in the unit ball in Rd.\nCorollary 1. For any P 2 P(Z), with probability at least 1 ,\n\nR%,1(P,bf ) R\u21e4%,1(P, F) \uf8ff\n\nC1pn\nwhere C1 is a constant dependent only on d, r0, s, B:\n\n3(ksk1 + B)2plog(2/)\nC1 = (B + ksk1) \u00b7\u2713144r0pdks0k1 + 192(1 + ks0k1)q2(r2\n\n0 + B2)\u25c6 .\nWe also consider the case of a massive nonparametric class. Let (HK,k\u00b7k K) be the Gaussian\nreproducing kernel Hilbert space (RKHS) with the kernel K(x1, x2) = expkx1 x2k2\n2/2 for\nsome > 0, and let Br := {h 2 HK : khkK \uf8ff r} be the radius-r ball in HK. Let F be the class of\nall functions of the form f (z) = f (x, y) = (y f0(x))2, where the predictors f0 : X ! R belong to\nIK(Br), an embedding of Br into the space C(X) of continuous real-valued functions on X equipped\nwith the sup norm kfkX := supx2X |f (x)|.\nUsing the covering number estimates due to Cucker and Zhou [7], we can prove the following\ngeneralization bounds for Gaussian RKHS.\nCorollary 2. With probability at least 1 , for any P 2 P(Z),\n192p2(r + B) \u00b7 (1 + rp2/)pr2\n\nR%,1(P,bf ) R\u21e4%,1(P, F) \uf8ff\nC1pn\n\n6(r2 + B2)plog(2/)\n\n(r2 + Br) +\n\n0 + B2\n\np2n\n\npn\n\n+\n\nwhere C1 is a constant dependent only on d, r0, :\n\nC1 = 48pd\u27132\u2713 d + 3\n\n2\n\n, log 2\u25c6 + (log 2)\n\nd+1\n\n2 \u25c6\u271332 +\n\n2560dr2\n0\n\n2\n\n2 \u25c6 d+1\n\n(here, (s, v) :=R 1\n\nv us1eudu is the incomplete gamma function).\n\n7\n\n\f4 Application: Domain adaptation with optimal transport\n\nAmbiguity sets based on Wasserstein distances have two attractive features. First, the metric geometry\nof the instance space provides a natural mechanism for handling uncertainty due to transformations\non the problem instances. For example, concurrent work by Sinha et al. [18] interprets the underlying\nmetric as a perturbation cost of an adversary in the context of adversarial examples [12]. Second,\nWasserstein distances can be approximated ef\ufb01ciently from the samples; Fournier and Guillin [10]\nprovide nonasymptotic convergence results in terms of both moments and probability for general p.\nThis allows us to approximate the Wasserstein distance between two distributions Wp(P, Q) by the\nWasserstein distance between their empirical distributions Wp(Pn, Qn), which makes it possible to\nspecify a suitable level of ambiguity %.\nOne interesting area of application, where we bene\ufb01t from both of these aspects is the problem of\ndomain adaptation, arising when we want to transfer the data/knowledge from a source domain\nP 2 P(Z) to a different but related target domain Q 2 P(Z) [4]. While the domain adaptation\nproblem is often stated in a broader context, we con\ufb01ne our discussion to adaptation in supervised\nlearning, assuming Z = X \u21e5 Y where X is the feature space and Y is the label space. From now on,\nwe disintegrate the source distribution as P = \u00b5 \u2326 PY |X and target distribution as Q = \u232b \u2326 QY |X.\nExisting theoretical results on domain adaptation are phrased in terms of the \u2018discrepancy metric\u2019\n[15]: given a loss function l : Y \u21e5 Y ! R and a family of predictors H of form h : X ! Y, the\ndiscrepancy metric is de\ufb01ned as\n\ndiscH(\u00b5, \u232b) := max\n\nh,h02H|E\u00b5 [l(h(X), h0(X))] E\u232b [l(h(X), h0(X))]| .\n\nTypical theoretical guarantees involving the discrepancy metric take the form of generalization\nbounds: for any h 2 H,\n\nR(Q, h) R\u21e4(Q, H) \uf8ff R(P, h) + discH(\u00b5, \u232b) + E\u232b\u21e5l(h\u21e4P (X), h\u21e4Q(X))\u21e4\n\n(12)\nwhere h\u21e4P and h\u21e4Q are minimizers of R(P, h) = EP [l(h(X), Y )] and R(Q, h) = EQ[l(h(X), Y )].\nWhile these generalization bounds provide a uniform guarantee for all predictors in a class, they can\nbe considered \u2018pessimistic\u2019 in the sense that we compare the excess risk to R(P, h), which is the\nperformance of some selected predictor at the source domain.\nOur work, on the other hand, aims to provide an excess risk bound for a speci\ufb01c target hypothesis\n\nbf given by the solution of a minimax ERM. Suppose that it is possible to estimate the Wasserstein\n\ndistance Wp(P, Q) between the two domain distributions. Then, as we show below, we can provide a\ngeneralization bound for the target domain by combining estimation guarantees for Wp(P, Q) with\nrisk inequalities of Section 2. All proofs are given in supplementary Appendix E.\nWe work in the setting considered by Courty et al. [6]: Let X, Y be metric spaces with metric dX and\ndY. We then endow Z with the `p product metric\n\ndZ(z, z0) = dZ((x, y), (x0, y0)) :=dp\n\nX(x, x0) + dp\n\nY(y, y0)1/p.\n\nWe assume that domain drift is due to an unknown (possibly nonlinear) transformation T : X ! X\nof the feature space that preserves the conditional distribution of the labels given the features, e.g.\nacquisition condition, sensor drift, thermal noise, etc. That is, \u232b = T#\u00b5, the pushforward of \u00b5 by T ,\nand for any x 2 X and any measurable set B \u2713 Y\n\nPY |X(B|x) = QY |X(B|T (x)).\n\n(13)\nThis assumption leads to the following lemma, which enables us to estimate Wp(P, Q) only from\nunlabeled source domain data and unlabeled target domain data:\nLemma 3. Suppose there exists a deterministic and invertible optimal transport map T : X ! X\nsuch that \u232b = T#\u00b5, i.e., W p\n\np (\u00b5, \u232b) = E\u00b5[dp\n\n(14)\nRemark 5. If X is a convex subset of Rd endowed with the `p metric dX(x, x0) = kx x0kp for\np 2, then, under the assumption that \u00b5 and \u232b have positive densities with respect to the Lebesgue\nmeasure, the (unique) optimal transport map from \u00b5 to \u232b is deterministic and a.e. invertible \u2013 in fact,\n\u21e4\nits inverse is equal to the optimal transport map from \u232b to \u00b5 [21].\n\nX(X, T (X))]. Then\nWp(P, Q) = Wp(\u00b5, \u232b).\n\n8\n\n\fNow suppose that we have n labeled examples (X1, Y1), . . . , (Xn, Yn) from P and m unlabeled\nexamples X01, . . . , X0m from \u232b. De\ufb01ne the empirical distributions\n\n\u00b5n =\n\n1\nn\n\nnXi=1\n\nXi,\u232b\n\nm =\n\n1\nm\n\nmXj=1\n\nX0j\n\n.\n\nNotice that, by the triangle inequality, we have\n\nWp(\u00b5, \u232b) \uf8ff Wp(\u00b5, \u00b5n) + Wp(\u00b5n,\u232b m) + Wp(\u232b, \u232bm).\n\n(15)\nHere, Wp(\u00b5n,\u232b m) can be computed from unlabeled data by solving a \ufb01nite-dimensional linear\nprogram [21], and the following convergence result of Fournier and Guillin [10] implies that, with\nhigh probability, both Wp(\u00b5, \u00b5n) and Wp(\u232b, \u232bm) rapidly converge to zero as n, m ! 1:\nProposition 5. Let \u00b5 be a probability distribution on a bounded set X \u21e2 Rd, where d > 2p. Let \u00b5n\ndenote the empirical distribution of X1, . . . , Xn\n\ni.i.d.\u21e0 \u00b5. Then, for any r 2 (0,1),\n\nP(Wp(\u00b5n, \u00b5) r) \uf8ff Ca exp(Cbnrd/p)\n\n(16)\n\nwhere Ca, Cb are constants depending on p, d, diam(X) only.\nRemark 6. Note that d > 2p is not a necessary constraint, and the bound still holds in the case\nd \uf8ff 2p with different speed of convergence. In particular, Proposition 5 is a constrained version of\n\u21e4\n[10, Thm. 2] under \ufb01nite E\u21b5,(\u00b5) for \u21b5 = d > p.\n\nBased on these considerations, we propose the following domain adaptation scheme:\n\n1. Compute the p-Wasserstein distance Wp(\u00b5n,\u232b m) between the empirical distributions of the\nfeatures in the labeled training set from the source domain P and the unlabeled training set\nfrom the target domain Q.\n\n2. Set the desired con\ufb01dence parameter 2 (0, 1) and the radius\n\n3. Compute the empirical risk minimizer\n\n\u25c6p/d\n\n+\u2713 log(4Ca/)\n\nCbm \u25c6p/d\n\nCbn\n\nb%() := Wp(\u00b5n,\u232b m) +\u2713 log(4Ca/)\nbf = arg min\n\nf2F\n\nRb%(),p(Pn, f ),\n\nwhere Pn is the empirical distribution of the n labeled samples from P .\n\n.\n\n(17)\n\n(18)\n\nWe can give the following target domain generalization bound for the hypothesis generated by (18):\nTheorem 4. Suppose that the feature space X is a bounded subset of Rd with d > 2p, take\ndX(x, x0) = kx x0kp, and let F be a family of hypotheses with Lipschitz constant at most L.\nThen, the empirical risk minimizer bf from (18) satis\ufb01es\nwith probability at least 1 .\nRemark 7. Comparing the bound of Theorem 4 with the discrepancy-based bound (12), we note that\nthe former does not contain any terms related to R(P,bf ) or the closeness of the optimal predictors for\n\nP and Q. The only contributions to the excess risk are the empirical Wasserstein distance Wp(\u00b5n,\u232b m)\n(which captures the discrepancy between the source and the target domains in a data-driven manner)\nand an empirical process \ufb02uctuation term. In this sense, the bound of Theorem 4 is closer in spirit to\n\u21e4\nthe usual excess risk bounds one obtains in the absence of domain drift.\n\nR(Q,bf ) R\u21e4(Q, F) \uf8ff 2Lb%() +\n\n3Mplog(4/)\n\npnb%p1\n\n48L \u00b7 diamp(Z)\n\n48C(F)\npn\n\n+\n\n+\n\np2n\n\n.\n\nAcknowledgement\n\nWe would like to thank Pierre Moulin, Yung Yi, and anonymous reviewers for helpful discussions.\n\n9\n\n\fReferences\n[1] L. Ambrosio, N. Gigli, and G. Savar\u00e9. Gradient Flows in Metric Spaces and in the Space of\n\nProbability Measures. Birkh\u00e4user, 2008.\n\n[2] Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector machines\nand other pattern classi\ufb01ers. Advances in Kernel methods\u2014support vector learning, pages\n43\u201354, 1999.\n\n[3] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds\n\nand structural results. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[4] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of\n\nlearning from different domains. Machine Learning, 79:151\u2013175, 2010.\n\n[5] J. Blanchet, Y. Kang, and K. Murthy. Robust wasserstein pro\ufb01le inference and applications to\n\nmachine learning. arXiv preprint 1610.05627v2, 2016.\n\n[6] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain\nadaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(9):1853\u20131865,\nSeptember 2017.\n\n[7] F. Cucker and D. Zhou. Learning theory: an approximation theory viewpoint. Cambridge\n\nUniversity Press, Cambridge, MA, 2007.\n\n[8] J. C. Duchi, P. W. Glynn, and H. Namkoong. Statistics of robust optimization: a generalized\n\nempirical likelihood approach. arXiv preprint 1610.03425, 2016.\n\n[9] F. Farnia and D. Tse. A minimax approach to supervised learning. In D. D. Lee, M. Sugiyama,\nU. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing\nSystems 29, pages 4240\u20134248, 2016.\n\n[10] N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical\n\nmeasure. Probability Theory and Related Fields, 162(3\u20134):707\u2013738, 2015.\n\n[11] R. Gao and A. J. Kleywegt. Distributionally robust stochastic optimization with Wasserstein\n\ndistance. arXiv preprint 1604.02199, 2016.\n\n[12] I. J. Goodfellow, J. Shlens, and Christian Szegedy. Explaining and harnessing adversarial\n\nexamples. arXiv preprint 1412.6572v3, 2014.\n\n[13] V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery\n\nProblems, volume 2033 of Lecture Notes in Mathematics. Springer, 2011.\n\n[14] Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the\n\ngeneralization error of combined classi\ufb01ers. Annals of Statistics, 30(1):1\u201350, 2002.\n\n[15] Y. Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds\nand algorithms. In Sanjoy Dasgupta and Adam Klivans, editors, Proceedings of 22nd Annual\nConference on Learning Theory, 2009.\n\n[16] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization\nusing the Wasserstein metric: performance guarantees and tractable reformulations. Mathemati-\ncal Programming, 171(1):115\u2013166, Sep 2018.\n\n[17] S. Sha\ufb01eezadeh-Abadeh, P. Mohajerin Esfahani, and D. Kuhn. Distributionally robust logistic\nregression. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 28, pages 1576\u20131584, 2015.\n\n[18] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness\nwith principled adversarial training. In International Conference on Learning Representations,\n2018.\n\n[19] M. Talagrand. Upper and Lower Bounds for Stochastic Processes: Modern Methods and\n\nClassical Problems. Springer, 2014.\n\n[20] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.\n[21] C. Villani. Topics in optimal transportation. American Mathematics Society, Providence, RI,\n\n2003.\n\n10\n\n\f", "award": [], "sourceid": 1380, "authors": [{"given_name": "Jaeho", "family_name": "Lee", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Maxim", "family_name": "Raginsky", "institution": "University of Illinois at Urbana-Champaign"}]}