{"title": "Robust Hypothesis Testing Using Wasserstein Uncertainty Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 7902, "page_last": 7912, "abstract": "We develop a novel computationally efficient and general framework for robust hypothesis testing. The new framework features a new way to construct uncertainty sets under the null and the alternative distributions, which are sets centered around the empirical distribution defined via Wasserstein metric, thus our approach is data-driven and free of distributional assumptions. We develop a convex safe approximation of the minimax formulation and show that such approximation renders a nearly-optimal detector among the family of all possible tests. By exploiting the structure of the least favorable distribution, we also develop a tractable reformulation of such approximation, with complexity independent of the dimension of observation space and can be nearly sample-size-independent in general. Real-data example using human activity data demonstrated the excellent performance of the new robust detector.", "full_text": "Robust Hypothesis Testing Using\n\nWasserstein Uncertainty Sets\n\nSchool of Industrial and Systems Engineering\n\nSchool of Industrial and Systems Engineering\n\nRui Gao\n\nLiyan Xie\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nrgao32@gatech.edu\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nlxie49@gatech.edu\n\nSchool of Industrial and Systems Engineering\n\nSchool of Industrial and Systems Engineering\n\nYao Xie\n\nHuan Xu\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nyao.xie@isye.gatech.edu\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nhuan.xu@isye.gatech.edu\n\nAbstract\n\nWe develop a novel computationally ef\ufb01cient and general framework for robust\nhypothesis testing. The new framework features a new way to construct uncertainty\nsets under the null and the alternative distributions, which are sets centered around\nthe empirical distribution de\ufb01ned via Wasserstein metric, thus our approach is\ndata-driven and free of distributional assumptions. We develop a convex safe\napproximation of the minimax formulation and show that such approximation\nrenders a nearly-optimal detector among the family of all possible tests. By\nexploiting the structure of the least favorable distribution, we also develop a\ntractable reformulation of such approximation, with complexity independent of\nthe dimension of observation space and can be nearly sample-size-independent in\ngeneral. Real-data example using human activity data demonstrated the excellent\nperformance of the new robust detector.\n\n1\n\nIntroduction\n\nHypothesis testing is a fundamental problem in statistics and an essential building block for scienti\ufb01c\ndiscovery and many machine learning problems such as anomaly detection. The goal is to develop\na decision rule or a detector which can discriminate between two (or multiple) hypotheses based\non data and achieve small error probability. For simple hypothesis test, it is well-known from the\nNeyman-Pearson Lemma that the likelihood ratio between the distributions of the two hypotheses\nis optimal. However, in practice, when the true distribution deviates from the assumed nominal\ndistribution, the performance of the likelihood ratio detector is no longer optimal and it may perform\npoorly.\nVarious robust hypothesis testing frameworks have been developed, to address the issue with dis-\ntribution misspeci\ufb01cation and outliers. The robust detectors are constructed by introducing various\nuncertainty sets for the distributions under the null and the alternative hypotheses. In non-parametric\nsetting, Huber\u2019s original work [13] considers the so-called \u0001-contamination sets, which contain distri-\nbutions that are close to the nominal distributions in terms of total variation metric. The more recent\nworks [17, 9] consider uncertainty set induced by Kullback-Leibler divergence around a nominal\ndistribution. Based on this, robust detectors usually depend on the so-called least-favorable distri-\nbutions (LFD). Although there has been much success in theoretical results, computation remains a\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmajor challenge in \ufb01nding robust detectors and \ufb01nding LFD in general. Existing results are usually\nonly for the one-dimensional setting. In multi-dimensional setting, \ufb01nding LFD remains an open\nquestion in the literature. In parametric setting, [1] provides a computationally ef\ufb01cient and provably\nnear-optimal framework for robust hypothesis testing based on convex optimization.\nIn this paper, we present a novel computationally ef\ufb01cient framework for developing data-driven\nrobust minimax detectors for non-parametric hypothesis testing based on the Wasserstein distance,\nin which the robust uncertainty set is chosen as all distributions that are close to the empirical\ndistributions in Wasserstein distance. This is very practical since we do not assume any parametric\nform for the distribution, but rather \u201clet the data speak for itself\u201d. Moreover, the Wasserstein distance\nis a more \ufb02exible measure of closeness between two distributions. The distance measures used in other\nnon-parametric frameworks [13, 17, 9] are not well-de\ufb01ned for distributions with non-overlapping\nsupport, which occurs often in (1) data-driven problems, in which we often want to measure the\ncloseness between an empirical distribution and some continuous underlying true distribution, and\n(2) high-dimensional problems, in which we may want to compare two distributions that are of high\ndimensions but supported on two low-dimensional manifolds with measure-zero intersection.\nTo solve the minimax robust detector problem, we face at least three dif\ufb01culties: (i) The hypothesis\ntesting error probability is a nonconvex function of the decision variable; (ii) The optimization over\nall possible detectors is hard in general since we consider any in\ufb01nite-dimensional detector with\nnonlinear dependence on data; (iii) The worst-case distribution over the uncertainty sets is also an\nin\ufb01nite dimensional optimization problem in general. To tackle these dif\ufb01culties, in Section 3, we\ndevelop a safe approximation of the minimax formulation by considering a family of tests with a\nspecial form that facilitates a convex approximation. We show that such approximation renders a\nnearly-optimal detector among the family of all possible tests (Theorem 1), and the risk of the optimal\ndetector is closely related to divergence measures (Theorem 2). In Section 4, exploiting the structure\nof the least favorable distributions yielding from Wasserstein uncertainty sets, we derive a tractable\nand scalable convex programming reformulation of the safe approximation based on strong duality\n(Theorem 3). Finally, Section 5 demonstrates the excellent performance of our robust detectors using\nreal-data for human activity detection.\n\n2 Problem Set-up and Related Work\nLet \u2126 \u2282 Rd be the observation space where the observed random variable takes its values. Denote\nby P(\u2126) be the set of all probability distributions on \u2126. Let P1,P2 \u2282 P(\u2126) be our uncertainty\nsets associated with hypothesis H1 and H2. The uncertainty sets are two families of probability\ndistributions on \u2126. We assume that the true probability distribution of the observed random variable\nbelongs to either P1 or P2. Given an observation \u03c9 of the random variable, we would like to decide\nwhich one of the following hypotheses is true\n\nH1 : \u03c9 \u223c P1, P1 \u2208 P1,\nH2 : \u03c9 \u223c P2, P2 \u2208 P2.\n\nA test for this testing problem is a (Lebesgue) measurable function T : \u2126 \u2192 {1, 2}. Given an\nobservation \u03c9 \u2208 \u2126, the test accepts hypotheses HT (\u03c9) and rejects the other. A test is called simple, if\nP1,P2 are singletons.\nThe worst-case risk of a test is de\ufb01ned as the maximum of the worst-case type-I and type-II errors\n\nP2{\u03c9 : T (\u03c9) = 1}(cid:17)\n\n.\n\n(cid:16)\n\n\u0001(T|P1,P2) := max\n\nP1{\u03c9 : T (\u03c9) = 2},\n\nsup\nP1\u2208P1\n\nsup\nP2\u2208P2\n\nHere, without loss of generality, we de\ufb01ne the risk to be the maximum of the two types of errors. Our\nframework can extend directly to the case where the risk is de\ufb01ned as a linear combination of the\nType-I and Type-II errors (as usually considered in statistics).\nWe consider the minimax robust hypothesis test formulation, where the goal is to \ufb01nd a test that\nminimizes the worst-case risk. More speci\ufb01cally, given P1,P2 and \u0001 > 0, we would like to \ufb01nd an\n\u0001-optimal solution of the following problem\n\nT :\u2126\u2192{1,2} \u0001(T|P1,P2).\n\ninf\n\n2\n\n(1)\n\n\fde\ufb01ned using the Wasserstein metric. Given two empirical distributions Qk = (1/nk)(cid:80)nk\nWe construct our uncertainty sets P1,P2 to be centered around two empirical distributions and\ni=1 \u03b4(cid:98)\u03c9k\n,\nwhich are based on samples drawn from two underlying distributions respectively, where \u03b4\u03c9 denotes\nthe Dirac measure on \u03c9. De\ufb01ne the sets using Wasserstein metric (of order 1):\n\ni\n\nPk = {P \u2208 P(\u2126) : W(P, Qk) \u2264 \u03b8k}, k = 1, 2,\n\n(2)\nwhere \u03b8k > 0 speci\ufb01es the radius of the set, and W(P, Q) denotes the Wasserstein metric of order 1:\n\n(cid:110)E(\u03c9,\u03c9(cid:48))\u223c\u03b3 [(cid:107)\u03c9 \u2212 \u03c9(cid:48)(cid:107)] : \u03b3 has mariginal distributions P and Q\n(cid:111)\n\nW(P, Q) := min\n\n\u03b3\u2208P(\u21262)\n\n,\n\nwhere (cid:107)\u00b7 \u2212 \u00b7(cid:107) is an arbitrary norm on Rn. We consider Wasserstein metric of order 1 for the ease\nof exposition. Intuitively, the joint distribution \u03b3 on the right-hand side of the above equation\ncan be viewed as a transportation plan which transports probability mass from P to Q. Thus, the\nWasserstein metric between two distributions equals the cheapest cost (measured in some norm\n(cid:107)\u00b7 \u2212 \u00b7(cid:107)) of transporting probability mass from one distribution to the other. In particular, if both P\nand Q are \ufb01nite-supported, the above minimization problem reduces to the transportation problem in\nlinear programming. Wasserstein metric has recently become popular in machine learning as a way\nto measuring the distance between probability distributions, and has been applied to a variety of areas\nincluding computer vision [25, 16, 23], generative adversarial networks [2, 10], and distributionally\nrobust optimization [6, 7, 4, 27, 26].\n\n2.1 Related Work\n\nWe present a brief review on robust hypothesis test and related work. The most commonly seen form\nof hypothesis test in statistics is simple hypothesis. The so-called simple hypothesis test assuming\nthat the null and the alternative distributions are two singleton sets. Suppose one is interested in\ndiscriminating between H0 : \u03b8 = \u03b80 and H1 : \u03b8 = \u03b81, when the data x is assumed to follow a\ndistribution f\u03b8 with parameter \u03b8. The likelihood ratio test rejects H0 when f\u03b81(x)/f\u03b80(x) exceeds a\nthreshold. The celebrated Neyman-Pearson lemma says that the likelihood ratio is the most powerful\ntest given a signi\ufb01cance level. In other words, the likelihood ratio test achieves the minimum Type-II\nerror given any Type-I error. In practice, when the true distributions deviate from the two assumed\ndistributions, especially in the presence of outliers, the likelihood ratio test is no longer optimal. The\nso-called robust detector aims to extend the simple hypothesis test to composite test, where the null\nand the alternative hypotheses include a family of distributions. There are two main approaches\nto the minimax robust hypothesis testing, one dates back to Huber\u2019s seminal work [13], and one\nis attributed to [17]. Huber considers composite hypotheses over the so-called \u0001-contamination\nsets which are de\ufb01ned as total variation classes of distributions around nominal distribution, while\nthe more recent work [17, 9] considers uncertainty sets de\ufb01ned using the Kullback-Leibler (KL)\ndivergence, and demonstrated various closed-form LFDs for one-dimensional setting. However,\nin the multi-dimensional setting, there remains the computational challenge to establish robust\nsequential detection procedures or to \ufb01nd the LFD. Indeed, closed-form LFDs are found only for\none-dimensional case (e.g, [12, 18, 9]). Moreover, classic hypothesis test is usually parametric in that\nthe distribution functions under the null and the alternative are assumed to be belong to a family of\ndistributions with certain parameters.\nRecent works [8, 14] take a different approach from the classic statistical approach for hypothesis\ntesting. Although \u201crobust hypothesis test\u201d are not mentioned, the formulation therein is essentially\nminimax robust hypothesis test, when the null and the alternative distributions are parametric with\nthe parameters belong to certain convex sets. They show that when exponential function is used as\na convex relaxation, the optimal detector corresponds to the likelihood ratio test between the two\nLFDs that are solved from a convex programming. Our work is inspired by [8, 14] and extends the\nstate-of-the-art in several ways. First, we consider more general classes of convex relaxations, and\nshow that they can produce a nearly-optimal detector for the original problem and admits an exact\ntractable reformulation for common convex surrogate loss functions. In contrast, the tractability of the\nframework in [8] relies heavily on the particular choice of the convex loss, because their parametric\nframework has stringent convexity requirement in distribution parameters which fails to hold for\ngeneral convex loss even for Gaussian case, while our non-parametric framework only requires\nconvexity in distribution which holds for general convex surrogates and imposes no conditions on the\nconsidered distributions. In addition, certain convex loss functions render a tighter nearly-optimal\n\n3\n\n\fdetector than the one considered in [8]. Furthermore, the tractability of our framework is due to\nnovel strong duality results Proposition 1 and Theorem 3. They are nontrivial, and to the best of our\nknowledge, cannot be obtained from extending strong duality results on robust hypothesis testing\n[8] and distributionally robust optimization (DRO) [4, 6, 7], as will be elaborated later. We \ufb01nally\nremark that [24] also considered using Wasserstein metric for hypothesis testing and drew connections\nbetween different test statistics. Our focus is different from theirs as we consider Wasserstein metric\nmainly for the minimax robust formulation.\n\n3 Optimal Detector\n\nWe consider a family of tests with a special form, which is referred as a detector. A detector\n\u03c6 : \u2126 \u2192 R is a measurable function associated with a test T\u03c6 which, for a given observation \u03c9 \u2208 \u2126,\naccepts H1 and rejects H2 whenever \u03c6(\u03c9) \u2265 0, otherwise accepts H2 and rejects H1. The restriction\nof problem (1) on the sets of all detectors is\n\nP2{\u03c9 : \u03c6(\u03c9) \u2265 0}(cid:17)\n\n.\n\n(3)\n\ninf\n\u03c6:\u2126\u2192R max\n\nP1{\u03c9 : \u03c6(\u03c9) < 0},\n\nsup\nP1\u2208P1\n\nsup\nP2\u2208P1\n\n(cid:16)\n\nWe next develop a safe approximation of problem (3) that provides an upper bound via convex\napproximations of the indicator function [22]. We introduce a notion called generating function.\nDe\ufb01nition 1 (Generating function). A generating function (cid:96) : R \u2192 R+ \u222a {\u221e} is a nonnegative\nvalued, nondecreasing, convex function satisfying (cid:96)(0) = 1 and limt\u2192\u2212\u221e (cid:96)(t) = 0.\nFor any probability distribution P , it holds that P{\u03c9 : \u03c6(\u03c9) < 0} \u2264 EP [(cid:96)(\u2212\u03c6(\u03c9))}]. Set\n\n\u03a6(\u03c6; P1, P2) := EP1[(cid:96) \u25e6 (\u2212\u03c6)(\u03c9))] + EP2 [(cid:96) \u25e6 \u03c6(\u03c9)].\n\nWe de\ufb01ne the risk of a detector for a test (P1,P2) by\nsup\n\n\u0001(\u03c6|P1,P2) :=\n\nP1\u2208P1,P2\u2208P2\n\n\u03a6(\u03c6; P1, P2).\n\nIt follows that the following problem provides an upper bound of problem (3):\n\ninf\n\n\u03c6:\u2126\u2192R\n\nsup\n\nP1\u2208P1,P2\u2208P2\n\n\u03a6(\u03c6; P1, P2).\n\n(4)\n\nWe next bound the gap between (4) and (1). To facilitate discussion, we introduce an auxiliary\nfunction \u03c8, which is well-de\ufb01ned due to the assumptions on (cid:96):\nt\u2208R [p(cid:96)(t) + (1 \u2212 p)(cid:96)(\u2212t)],\n\n0 \u2264 p \u2264 1.\n\n\u03c8(p) := min\n\nFor various generating functions (cid:96), \u03c8 admits a close-form expression. Table 1 lists some choices of\ngenerating function (cid:96) and their corresponding auxiliary function \u03c8. Note that the Hinge loss (last row\nin the table) leads to the smallest relaxation gap. As we shall see, \u03c8 plays an important role in our\nanalysis, and is closely related to the divergence measure between probability distributions.\n\nTable 1: Generating function (\ufb01rst column) and its corresponding auxiliary function (second column), optimal\ndetector (third column), and detector risk (fourth column).\n1 \u2212 1/2 inf \u03c6 \u03a6(\u03c6; P1, P2)\n\n\u03c8(p)\n\n(cid:96)(t)\n\n2(cid:112)p(1 \u2212 p)\n\n4p(1 \u2212 p)\n\n2 min(p, 1 \u2212 p)\n\n\u03c6\u2217\n\nlog(cid:112)p1/p2\n\nlog(p1/p2)\n1 \u2212 2 p1\nsgn(p1 \u2212 p2)\n\np1+p2\n\nH 2(P1, P2)\n\nJS(P1, P2)/ log 2\n\n\u03c72(P1, P2)\nT V (P1, P2)\n\nlog(1 + exp(t))/ log 2 \u2212(p log p + (1 \u2212 p) log(1 \u2212 p))/ log 2\n\nexp(t)\n\n(t + 1)2\n+\n(t + 1)+\n\nTheorem 1 (Near-optimality of (4)). For any distributions Q1 and Q2, and any non-empty uncertainty\nsets P1 and P2, whenever there exists a feasible solution T of problem (1) with objective value less\nthan \u0001 \u2208 (0, 1/2), there exists a feasible solution \u03c6 of problem (4) with objective value less than \u03c8(\u0001).\n\n4\n\n\fTheorem 1 guarantees that the approximation (4) of problem\n(1) is nearly optimal, in the sense that whenever the hypothe-\nses H1, H2 can be decided upon by a test T with risk less\nthan \u0001, there exists a detector \u03c6 with risk less than \u03c8(\u0001). It\nholds regardless the speci\ufb01cation of P1 and P2. Figure 1\nillustrates the value of \u03c8(\u0001) when \u0001 \u2208 (0, 1/2).\n\nThe next proposition shows that we can interchange the inf\nand sup operators. Hence, in order to solve (4), we can \ufb01rst\nsolve the problem of \ufb01nding the best detector for a simple\ntest (P1, P2), and then \ufb01nding the least favorable distribution\nthat maximizes the risk among those best detectors.\nProposition 1. For the Wasserstein uncertainty sets P1,P2 speci\ufb01ed in (2), we have\ninf\n\u03c6:\u2126\u2192R \u03a6(\u03c6; P1, P2).\n\n\u03a6(\u03c6; P1, P2) =\n\n\u03c6:\u2126\u2192R\n\nsup\n\nsup\n\ninf\n\nP1\u2208P1,P2\u2208P2\n\nP1\u2208P1,P2\u2208P2\n\nFigure 1: \u03c8(\u0001) as a function of \u0001\n\nTo establish Proposition 1, observe that the sets under inf and sup are: (i) both in\ufb01nitely dimensional,\n(ii) the Wasserstein ball is not compact in the space of probability measures, and (iii) the space of all\ntests \u03c6 is not endowed with a linear topological structure, so there is no readily applicable tools (such\nas Sion\u2019s minimax theorem used in [8]) to justify the interchange of inf and sup. Our proof strategy\nis to identify an approximate optimal detector for the sup inf problem on the left side of (5) using\nTheorem 3 (whose proof does not depend on the result in Proposition 1), and then verify it is also an\napproximate optimal solution for the original inf sup problem (4). We also note that such issue does\nnot occur in the distributionally robust optimization setting, as their focus is to study only the inner\nsupremum, while the outer in\ufb01mum in those problems are already \ufb01nite-dimensional by de\ufb01nition (in\nfact it corresponds to decision variables in Rn).\nThe next theorem provides an expression of the optimal detector and its risk.\nTheorem 2 (Optimal detector). For any distributions P1 and P2, let\nderivative of Pk with respect to P1 + P2, k = 1, 2. Then\n\nd(P1+P2) be the Radon-Nikodym\n\ndPk\n\n\u03c8(cid:0)\n\n(cid:90)\nd(P1+P2) (\u03c9) < 1, k = 1, 2(cid:9). Suppose there exists a\n\n(cid:1)d(P1 + P2).\n\nd(P1+P2)\n\ndPk\n\ndP1\n\n\u2126\n\n\u03c6:\u2126\u2192R \u03a6(\u03c6; P1, P2) =\ninf\n\nDe\ufb01ne \u21260(P1, P2) := (cid:8)\u03c9 \u2208 \u2126 : 0 <\n(cid:104)\n\nwell-de\ufb01ned map t : \u2126 \u2192 R such that\nt\u2217(\u03c9) \u2208 arg mint\u2208R\n\nd(P1+P2) (\u03c9)(cid:96)(\u2212t) + dP2\n\ndP1\n\nd(P1+P2) (\u03c9)(cid:96)(t)\n\n(cid:105)\n\n.\n\nThen \u03c6\u2217(\u03c9) := \u2212t\u2217(\u03c9) is an optimal detector for the simple test.\nRemark 1. By de\ufb01nition, \u03c8(0) = \u03c8(1) = 0. Then the in\ufb01mum depends only on the value of P1, P2\non \u21260, the subset of \u2126 on which P1 and P2 are absolutely continuous with respect to each other:\n\n(cid:90)\n\n\u21260\n\n\u03c8(cid:0)\n\n(cid:1)d(P1 + P2).\n\ninf\n\u03c6:\u2126\u2192R \u03a6(\u03c6; P1, P2) =\n\ndP1\n\nd(P1+P2)\n\nThis is intuitive as we can always \ufb01nd a detector \u03c6 such that its value is arbitrarily close to zero on\n\u2126 \\ \u21260. In particular, if P1 and P2 have measure-zero overlap, then inf \u03c6 \u03a6(\u03c6; P1, P2) equals zero,\nthat is, the optimal test for the simple test (P1, P2) has zero risk.\nOptimal detector \u03c6\u2217. Set pk := dPk/(d(P1 + P2)) on \u21260, k = 1, 2. For the four choices of \u03c8 listed\nin Table 1, the optimal detectors \u03c6\u2217 on \u21260 are listed in the third column, where sgn denotes the sign\nfunction. The \ufb01rst one has been considered in [1].\nRelation between divergence measures and the risk of the optimal detector. The term\n\n(cid:1)d(P1 + P2) can be viewed as a \u201cmeasure of closeness\u201d between probability distribu-\n\n(cid:82)\n\ndP1\n\n\u2126 \u03c8(cid:0)\n\ntions. Indeed, in the fourth column of Table 1 we show that the smallest detector risk for a simple\ntest P1 vs. P2 equals the negative of some divergence between P1 and P2 up to a constant, where H,\nJS, \u2206, and T V represent respectively the Hellinger distance [11], Jensen-Shannon divergence [20],\n\nd(P1+P2)\n\n5\n\n00.10.20.30.40.500.10.20.30.40.50.60.70.80.91\ftriangle discrimination (symmetric \u03c72-divergence) [28], and Total Variation metric [28]. It follows\nfrom Theorem 2 that\n\n(cid:90)\n\n\u03c8(cid:0)\n\n(cid:1)d(P1 + P2).\n\n(5)\n\nsup\n\nP1\u2208P1,P2\u2208P2\n\n\u03c6:\u2126\u2192R \u03a6(\u03c6; P1, P2) =\ninf\n\nsup\n\nP1\u2208P1,P2\u2208P2\n\n\u2126\n\ndP1\n\nd(P1+P2)\n\nThe objective on the right-hand side is concave in (P1, P2) since by Theorem 2, it equals to the\nin\ufb01mum of linear functions \u03a6(\u03c6; P1, P2) of (P1, P2). Problem (5) can be interpreted as \ufb01nding two\ndistributions P \u2217\n2 is minimized.\nThis makes sense in that the least favorable distribution (P \u2217\n2 ) should be as close to each other as\npossible for the worst-case hypothesis test scenario.\n\n2 \u2208 P2 such that the divergence between P \u2217\n\n1 \u2208 P1 and P \u2217\n\n1 and P \u2217\n\n1 , P \u2217\n\nIn this section, we provide a tractable reformulation of (5) by deriving a novel strong duality result.\nRecall in our setup, we are given two empirical distributions Qk = 1\n, k = 1, 2. To unify\nnk\nnotation, for l = 1, . . . , n1 + n2, we set\n\nk\n\n(cid:80)nk\ni=1 \u03b4(cid:98)\u03c9i\n\n4 Tractable Reformulation\n\n(cid:26)(cid:98)\u03c9l\n(cid:98)\u03c9l\u2212n1\nand set(cid:98)\u2126 := {\u03c9l : l = 1, . . . , n1 + n2}.\n\n\u03c9l =\n\n1,\n\n2\n\n1 \u2264 l \u2264 n1,\n\n, n1 + 1 \u2264 l \u2264 n1 + n2,\n\nmax\n\np1,p2\u2208Rn1+n2\n\n+\n\n\u03b31,\u03b32\u2208R(n1+n2 )\n\n\u00d7R(n1+n2)\n\n+\n\n+\n\nsubject to\n\nTheorem 3 (Convex equivalent reformulation). Problem (5) with P1,P2 speci\ufb01ed in (2) can be\nequivalently reformulated as a \ufb01nite-dimensional convex program\n\nl=1\n\nn1+n2(cid:88)\nn1+n2(cid:88)\nn1+n2(cid:88)\nn1+n2(cid:88)\nn1+n2(cid:88)\n\nm=1\n\nm=1\n\nl=1\n\n1\n\n(cid:1)\n\n2)\u03c8(cid:0) pl\n(cid:13)(cid:13)\u03c9l \u2212 \u03c9m(cid:13)(cid:13) \u2264 \u03b8k, k = 1, 2,\n\npl\n1+pl\n2\n\n\u03b3lm\nk\n\n(pl\n\n1 + pl\n\nn1+n2(cid:88)\n\nm=1\n\nn1+n2(cid:88)\nn1+n2(cid:88)\n\nm=1\n\n\u03b3lm\n1 =\n\n1\nn1\n\n, 1 \u2264 l \u2264 n1,\n\n1 = 0, n1 + 1 \u2264 l \u2264 n1 + n2,\n\u03b3lm\n\n2 = 0, 1 \u2264 l \u2264 n1,\n\u03b3lm\n\n1\nn2\nk , 1 \u2264 m \u2264 n1 + n2, k = 1, 2.\n\n\u03b3lm\n2 =\n\nm=1\n\n\u03b3lm\nk = pm\n\n, n1 + 1 \u2264 l \u2264 n1 + n2,\n\nl=1\n\n(6)\n\nlm \u03b3lm\n\nTheorem 3, combining with Proposition 1, indicates that problem (4) is equivalent to problem (6).\nWe next explain various elements in problem (6).\n\nDecision variables. pk can be identi\ufb01ed with a probability distribution on (cid:98)\u2126, because(cid:80)\n(cid:80)\n\nk = 1, and \u03b3k can be viewed as a joint probability distribution on (cid:98)\u21262 with marginal\n\ndistributions Qk and pk. We can eliminate variables p1, p2 by substituting pk with \u03b3k using the last\nconstraint, so that \u03b31, \u03b32 are the only decision variables.\nObjective. The objective function is identical to the objective function of (5), and thus we are\nmaximizing a concave function of (p1, p2). If we substitute pk with \u03b3k, then the objective function is\nalso concave in (\u03b31, \u03b32).\nConstraints. The constraints are all linear. Note that \u03c9l are parameters, but not decision variables,\nequivalent to the Wasserstein metric constraints W(Qk, pk) \u2264 \u03b8k.\nStrong duality. Problem (6) is a restriction of problem (4) in the sense that they have the same\nobjective but (4) restricts the feasible region to the subset of distributions that are supported on a\n\nthus(cid:13)(cid:13)\u03c9l \u2212 \u03c9m(cid:13)(cid:13) can be computed before solving the program. The constraints all together are\n\nk =\n\nl pl\n\n6\n\n\fsubset(cid:98)\u2126. Nevertheless, Theorem 3 guarantees that the two problems has the same optimal value,\nbecause there exists a least favorable distribution supported on(cid:98)\u2126, as explained below.\n\nIntuition on the reformulation.\nWe here provide insights on the structural properties of the\nleast favorable distribution that explain why the reduction\nin Theorem 3 holds. The complete proof of Theorem 3\n\n\u2126 = Rd and \u03c8(p) = 2(cid:112)p(1 \u2212 p). Note that Wasserstein\ncan be found in Appendix. Suppose Qk = \u03b4(cid:98)\u03c9k, k = 1, 2,\nmetric measures the cheapest cost (measured in (cid:107)\u00b7 \u2212 \u00b7(cid:107)) of\ntransporting probability mass from one distribution to the\nother. Thus, based on the discussion in Section 3, the goal\nof problem (5) is to move (part of) the probability mass on\n\u02c6\u03c91 and \u02c6\u03c92 such that the negative divergence between the\nresulting distributions is maximized. The following three key\nobservations demonstrate how to move the probability mass\nin a least favorable way.\n(i) Consider feasible solutions of the form\n\n(P1, P2) =(cid:0)(1 \u2212 p1)\u03b4(cid:98)\u03c91 + p1\u03b4\u03c91 , (1 \u2212 p2)\u03b4(cid:98)\u03c92 + p2\u03b4\u03c92\n\nFigure 2: Illustration of the least favor-\nable distribution: it is always better off\n\nto move the probability mass from(cid:98)\u03c91\nand (cid:98)\u03c92 to an identical point \u03c9 on the\nline segment connecting(cid:98)\u03c91,(cid:98)\u03c92.\n(cid:1), \u03c91, \u03c92 \u2208 \u2126 \\ {(cid:98)\u03c91,(cid:98)\u03c92}.\n\nNamely, (P1, P2) is obtained by moving out probability mass pk > 0 from(cid:98)\u03c9k to \u03c9k, k = 1, 2 (see\n\nFigure 2). It follows that the objective value\n\n(cid:90)\n\n\u2126\n\n(cid:26)2\n\n\u221a\n\n0,\n\n\u03c8(\n\ndP1\n\nd(P1+P2) )d(P1 + P2) =\n\np1p2,\n\nif \u03c91 = \u03c92,\no.w.\n\nThis is consistent with Remark 1 in that the objective value vanishes if the supports of P1, P2 do\nnot overlap. Moreover, when \u03c91 = \u03c92, the objective value is independent of their common value\n\n\u03c9 = \u03c91 = \u03c92. Therefore, we should move probability mass out of resources(cid:98)\u03c91,(cid:98)\u03c92 to some common\n\nregion, which contain points that receive probability mass from both resources.\n(ii) Motivated by (i), we consider solutions of the following form\n\n\u221a\nwhich has the same objective value 2\nconstraint, i.e., to minimize the transport distance\n\n(P1, P2) =(cid:0)(1 \u2212 p1)\u03b4(cid:98)\u03c91 + p1\u03b4\u03c9, (1 \u2212 p2)\u03b4(cid:98)\u03c92 + p2\u03b4\u03c9\np1 (cid:107)\u03c91 \u2212(cid:98)\u03c91(cid:107) + p2 (cid:107)\u03c92 \u2212(cid:98)\u03c92(cid:107) ,\n\nby triangle inequality we should choose \u03c91 = \u03c92 = \u03c9 to be on the line segment connecting(cid:98)\u03c91 and\n(cid:98)\u03c92 (see Figure 2).\n\n(cid:1), \u03c9 \u2208 \u2126 \\ {(cid:98)\u03c91,(cid:98)\u03c92},\n\np1p2. In order to save the budget for the Wasserstein metric\n\n1\u03b4\u03c9(cid:48)\n\nk\n\n(iii) Motivated by (ii), we consider solutions of the following form\n\nk)\u03b4(cid:98)\u03c9k + p1\u03b4\u03c9k + p(cid:48)\n\n, k = 1, 2,\n\nk = (1 \u2212 pk \u2212 p(cid:48)\nP (cid:48)\n\nwhere \u03c9k, \u03c9(cid:48)\nvalue is maximized at \u03c91 = \u03c9(cid:48)\n\nk /\u2208 \u2126\\{(cid:98)\u03c9k} are on the line segment connecting(cid:98)\u03c91 and(cid:98)\u03c92, k = 1, 2. Then the objective\n1 = (cid:98)\u03c92, \u03c92 = \u03c9(cid:48)\n2(cid:112)(1 \u2212 p1 \u2212 p(cid:48)\n2). Hence it is better off to move out probability mass from(cid:98)\u03c91 to(cid:98)\u03c92\nand from(cid:98)\u03c92 to(cid:98)\u03c91.\n1)(1 \u2212 p2 \u2212 p(cid:48)\nTherefore, we conclude that there exist a least favorable distribution supported on(cid:98)\u2126. The argument\n\n2 = (cid:98)\u03c91, and equals 2(cid:112)(p1 + p(cid:48)\n\nabove utilizes Theorem 2, the triangle inequality of a norm and the concavity of the auxiliary function\n\u03c8. The compete proof can be viewed as a generalization to the in\ufb01nitesimal setting.\nComplexity. Problem (6) is a convex program which maximizes a concave function subject to linear\nconstraints. We brie\ufb02y comment on the complexity of solving (6) in terms of the dimension of the\nobservation space and the sample sizes:\n(i) The complexity of (6) is independent of the dimension d of \u2126, since we only need to compute\n\npairwise distances(cid:13)(cid:13)\u03c9l \u2212 \u03c9m(cid:13)(cid:13) as an input to the convex program.\n\n1)(p2 + p(cid:48)\n\n2) +\n\n(ii) The complexity in terms of the sample sizes n1, n2 depends on the objective function and can be\nnearly sample size-independent when the objective function is Lipschitz in (cid:96)1 norm (equivalently,\n\n7\n\n\f+\n\nthe (cid:96)\u221e norm of the partial derivative is bounded). The reasons are as follows. In this case, after\neliminating variables p1, p2, we end up with a convex program involving only \u03b31, \u03b32, and the Lipschitz\nconstant of the objective with respect to \u03b3 is identical to that with respect to p. Observe that the\nfeasible region of each \u03b3k is a subset of the (cid:96)1-ball in R(n1+n2)\n. Then according to the complexity\ntheory of the \ufb01rst order method for convex optimization [3], when the objective function is Lipschitz\nin (cid:96)1 norm, the complexity is O(ln(n1) + ln(n2)). Notice that this is true for all except for the \ufb01rst\ncase in Table 1. Hence, this is a quite general.\nWe \ufb01nally remark that extending previous strong duality results on DRO [4, 6, 7] from one Wasserstein\nball to two Wasserstein balls does not lead to an immediately tractable (convex) reformulation. For\none thing, simply applying those previous results on the inner supremum in (4) does not work,\nbecause after doing so we are left with the outer in\ufb01mum that is still intractable. For another thing,\napplying the previous methodology onto problem (5) does not lead to an tractable reformulation\nd(P1+P2) )d(P1 + P2) is not separable in P1 and\nP2, but depends on the density on the common support of P1 and P2. Thus, as argued in Section 4, in\nthe least-favorable distribution the probability mass of the two distributions cannot be transported\narbitrarily, but are linked via their common support. In contrast, the problems in DRO [4, 6, 7] have\nno such linking constraints, which makes it hard to extend the previous methodology. Instead, we\nprove the strong duality from scratch and provide new insights on the structural properties of the\nleast-favorable distribution that are different in nature from that in DRO settings.\n\neither, mainly because the objective function(cid:82)\n\n\u2126 \u03c8(\n\ndP1\n\n5 Numerical Experiments\n\nIn this section, we demonstrate the performance of our robust detector using real data for human\nactivity detection. We adopt a dataset released by the Wireless Sensor Data Mining (WISDM) Lab in\nOctober 2013. The data in this set were collected with the Actitracker system, which is described in\n[19, 29, 15]. A large number of users carried an Android-based smartphone while performing various\neveryday activities. These subjects carried the Android phone in their pocket and were asked to walk,\njog, ascend stairs, descend stairs, sit, and stand for speci\ufb01c periods of time.\nThe data collection was controlled by an application executed on the phone. This application is able\nto record the user\u2019s name, start and stop the data collection, and label the activity being performed.\nIn all cases, the accelerometer data is collected every 50ms, so there are 20 samples per second.\nThere are 2,980,765 recorded time-series in total. The activity recognition task involves mapping\ntime-series accelerometer data to a single physical user activity [29]. Our goal is to detect the change\nof activity in real-time from sequential observations. Since it is hard to model distributions for various\nactivities, traditional parametric methods do not work well in this case.\nFor each person, the recorded time-series contains the acceleration of the sensor in three directions.\nIn this setting, every \u03c9l is a three-dimensional vector (al\nz). We set \u03b81 = \u03b82 = \u03b8 as the sample\nsizes are identical, and \u03b8 is chosen such that the quantity 1 \u2212 1/2 inf \u03c6 \u03a6(\u03c6; P \u2217\n2 ) in Table 1, or\nequivalently, the divergence between P \u2217\n2 , is close to zero with high probability if Q1 and\nQ2 are bootstrapped from the data before change, where P \u2217\n1 is the LFD yielding from (6). The\nintuition is that we want the Wasserstein ball to be large enough to avoid false detection while still\nhave separable hypotheses (so the problem is well-de\ufb01ned).\nWe compare our robust detector, when coupled with CUSUM detector using a scheme similar to [5],\nwith the Hotelling T2 control chart, which is a traditional way to detect the mean and covariance\nchange for the multivariate case. The Hotelling control chart plots the following quantity [21]:\n\n1 and P \u2217\n\n1 , P \u2217\n\n1 , P \u2217\n\nx, al\n\ny, al\n\nT 2 = (x \u2212 \u00b5)(cid:48)\u03a3\u22121(x \u2212 \u00b5),\n\nwhere \u00b5 and \u03a3 are the sample mean and sample covariance obtained from training data.\nAs shown in Fig. 3 (a), in many cases, Hotelling T2 fails to detect the change successfully and our\nmethod performs pretty well. This is as expected since the change is hard to capture via mean and\ncovariance as Hotelling does.\n1 \u2212 p\u2217\nMoreover, we further test the proposed robust detector, \u03c6\u2217 = 1\n2),\non 100 sequences of data. Here p\u2217\n2 are the LFD computed from the optimization problem (6).\nFor each sequence, we choose the threshold for detection by controlling the type-I error. Then we\ncompare the average detection delay of the robust detector and the Hotelling T2 control chart, as\n\n2) and \u03c6\u2217 = sgn(p\u2217\n\n1 and p\u2217\n\n2 ln(p\u2217\n\n1/p\u2217\n\n8\n\n\f(a)\n\n(b)\n\nFigure 3: Comparison of the detector \u03c6\u2217 = 1\n2) with Hotelling control chart: (a): Upper: the\nproposed optimal detector; Middle: the Hotelling T2 control chart; Lower: the raw data, here we plot\nz)1/2 for simple illustration. The dataset is a portion of full observations from the person\n(a2\nindexed by 1679, with the pre-change activity jogging and post-change activity walking. The black\ndotted line at index 4589 indicates the boundary between the pre-change and post-change regimes.\n(b): The average detection delay v.s. type-I error. The average is taken over 100 sequences of data.\n\n2 ln(p\u2217\n\n1/p\u2217\n\nx + a2\n\ny + a2\n\nshown in 3 (b). The robust detector has a clear advantage, and the sgn(p\u2217\nperformance than 1\n\n2), consistent with our theoretical \ufb01nding.\n\n1/p\u2217\n\n2 ln(p\u2217\n\n1 \u2212 p\u2217\n\n2) indeed has better\n\n6 Conclusion\n\nIn this paper, we propose a data-driven, distribution-free framework for robust hypothesis testing\nbased on Wasserstein metric. We develop a computationally ef\ufb01cient reformulation of the minimax\nproblem which renders a nearly-optimal detector. The framework is readily extended to multiple\nhypotheses and sequential settings. The approach can also be extended to other settings, such as\nconstraining the Type-I error to be below certain threshold (as the typical statistical test of choosing\nthe size or signi\ufb01cance level of the test), or considering minimizing a weighed combination of\nthe Type-I and Type-II errors. In the future, we will study the optimal selection of the size of the\nuncertainty sets leveraging tools from distributionally robust optimization, and test the performance\nof our framework on large-scale instances.\n\n9\n\n452045544589461900.20.4optimal detector45204554458946190510Hotelling control chart4520455445894619sample index05001000raw dataPre-changePre-changePre-changePost-changePost-changePost-change00.050.10.150.20.250.30.35type-I error00.511.522.5average detection delayHotellingcontrolchartdetector?$=sgn(p$1!p$2)detector?$=12ln(p$1=p$2)\fReferences\n[1] Anatoli Juditsky Alexander Goldenshluger and Arkadi Nemirovski. Hypothesis testing by\n\nconvex optimization. Electron. J. Statist., 9(2):1645\u20131712, 2015.\n\n[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[3] Ahron Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimization: analysis,\n\nalgorithms, and engineering applications, volume 2. Siam, 2001.\n\n[4] Jose Blanchet and Karthyek RA Murthy. Quantifying distributional model risk via optimal\n\ntransport. arXiv preprint arXiv:1604.01446, 2016.\n\n[5] Yang Cao and Yao Xie. Robust sequential change-point detection by convex optimization. In\n\nInternational Symposium on Information Theory (ISIT), 2017.\n\n[6] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization\nusing the wasserstein metric: Performance guarantees and tractable reformulations. Mathemati-\ncal Programming, pages 1\u201352, 2015.\n\n[7] Rui Gao and Anton J Kleywegt. Distributionally robust stochastic optimization with wasserstein\n\ndistance. arXiv preprint arXiv:1604.02199, 2016.\n\n[8] Alexander Goldenshluger, Anatoli Juditsky, Arkadi Nemirovski, et al. Hypothesis testing by\n\nconvex optimization. Electronic Journal of Statistics, 9(2):1645\u20131712, 2015.\n\n[9] Gokhan Gul and Abdelhak M. Zoubir. Minimax robust hypothesis testing. IEEE Transactions\n\non Information Theory, 63(9):5572 \u2013 5587, 2017.\n\n[10] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\npages 5769\u20135779, 2017.\n\n[11] Ernst Hellinger. Neue begr\u00fcndung der theorie quadratischer formen von unendlichvielen\n\nver\u00e4nderlichen. Journal f\u00fcr die reine und angewandte Mathematik, 136:210\u2013271, 1909.\n\n[12] Peter J Huber. Robust statistics. 1981.\n\n[13] Peter J Huber. A robust version of the probability ratio test. The Annals of Mathematical\n\nStatistics, 36(6):1753\u20131758, 1965.\n\n[14] Anatoli Juditsky and Arkadi Nemirovski. Hypothesis testing via af\ufb01ne detectors. Electron. J.\n\nStatist., 10(2):2204\u20132242, 2016.\n\n[15] Jennifer R Kwapisz, Gary M Weiss, and Samuel A Moore. Activity recognition using cell\n\nphone accelerometers. ACM SigKDD Explorations Newsletter, 12(2):74\u201382, 2011.\n\n[16] Elizaveta Levina and Peter Bickel. The earth mover\u2019s distance is the mallows distance: Some\ninsights from statistics. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE\nInternational Conference on, volume 2, pages 251\u2013256. IEEE, 2001.\n\n[17] B. C. Levy. Robust hypothesis testing with a relative entropy tolerance. IEEE Transactions on\n\nInformation Theory, 55(1):413\u2013421, 2009.\n\n[18] Bernard C Levy. Principles of signal detection and parameter estimation. Springer Science &\n\nBusiness Media, 2008.\n\n[19] Jeffrey W Lockhart, Gary M Weiss, Jack C Xue, Shaun T Gallagher, Andrew B Grosner,\nand Tony T Pulickal. Design considerations for the wisdm smart phone-based sensor mining\narchitecture. In Proceedings of the Fifth International Workshop on Knowledge Discovery from\nSensor Data, pages 25\u201333. ACM, 2011.\n\n[20] Christopher D Manning, Christopher D Manning, and Hinrich Sch\u00fctze. Foundations of statistical\n\nnatural language processing. MIT press, 1999.\n\n10\n\n\f[21] Douglas C Montgomery. Introduction to statistical quality control. John Wiley & Sons (New\n\nYork), 2009.\n\n[22] Arkadi Nemirovski and Alexander Shapiro. Convex approximations of chance constrained\n\nprograms. SIAM Journal on Optimization, 17(4):969\u2013996, 2006.\n\n[23] Julien Rabin, Gabriel Peyr\u00e9, Julie Delon, and Marc Bernot. Wasserstein barycenter and its\napplication to texture mixing. In International Conference on Scale Space and Variational\nMethods in Computer Vision, pages 435\u2013446. Springer, 2011.\n\n[24] Aaditya Ramdas, Nicol\u00e1s Garc\u00eda Trillos, and Marco Cuturi. On wasserstein two-sample testing\n\nand related families of nonparametric tests. Entropy, 19(2):47, 2017.\n\n[25] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover\u2019s distance as a metric for\n\nimage retrieval. International journal of computer vision, 40(2):99\u2013121, 2000.\n\n[26] Soroosh Sha\ufb01eezadeh-Abadeh, Peyman Mohajerin Esfahani, and Daniel Kuhn. Distributionally\nIn Advances in Neural Information Processing Systems, pages\n\nrobust logistic regression.\n1576\u20131584, 2015.\n\n[27] Aman Sinha, Hongseok Namkoong, and John Duchi. Certi\ufb01able distributional robustness with\n\nprincipled adversarial training. arXiv preprint arXiv:1710.10571, 2017.\n\n[28] Flemming Topsoe. Some inequalities for information divergence and related measures of\n\ndiscrimination. IEEE Transactions on information theory, 46(4):1602\u20131609, 2000.\n\n[29] Gary M Weiss and Jeffrey W Lockhart. The impact of personalization on smartphone-based\nactivity recognition. In AAAI Workshop on Activity Context Representation: Techniques and\nLanguages, pages 98\u2013104, 2012.\n\n11\n\n\f", "award": [], "sourceid": 4901, "authors": [{"given_name": "RUI", "family_name": "GAO", "institution": "GEORGIA TECH"}, {"given_name": "Liyan", "family_name": "Xie", "institution": "Georgia Institute of Technology"}, {"given_name": "Yao", "family_name": "Xie", "institution": "Georgia Institute of Technology"}, {"given_name": "Huan", "family_name": "Xu", "institution": "Georgia Inst. of Technology"}]}