{"title": "Distributionally Robust Logistic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1576, "page_last": 1584, "abstract": "This paper proposes a distributionally robust approach to logistic regression. We use the Wasserstein distance to construct a ball in the space of probability distributions centered at the uniform distribution on the training samples. If the radius of this Wasserstein ball is chosen judiciously, we can guarantee that it contains the unknown data-generating distribution with high confidence. We then formulate a distributionally robust logistic regression model that minimizes a worst-case expected logloss function, where the worst case is taken over all distributions in the Wasserstein ball. We prove that this optimization problem admits a tractable reformulation and encapsulates the classical as well as the popular regularized logistic regression problems as special cases. We further propose a distributionally robust approach based on Wasserstein balls to compute upper and lower confidence bounds on the misclassification probability of the resulting classifier. These bounds are given by the optimal values of two highly tractable linear programs. We validate our theoretical out-of-sample guarantees through simulated and empirical experiments.", "full_text": "Distributionally Robust Logistic Regression\n\nSoroosh Sha\ufb01eezadeh-Abadeh\n\nPeyman Mohajerin Esfahani\n\nDaniel Kuhn\n\n\u00b4Ecole Polytechnique F\u00b4ed\u00b4erale de Lausanne, CH-1015 Lausanne, Switzerland\n\n{soroosh.shafiee,peyman.mohajerin,daniel.kuhn} @epfl.ch\n\nAbstract\n\nThis paper proposes a distributionally robust approach to logistic regression. We\nuse the Wasserstein distance to construct a ball in the space of probability distribu-\ntions centered at the uniform distribution on the training samples. If the radius of\nthis ball is chosen judiciously, we can guarantee that it contains the unknown data-\ngenerating distribution with high con\ufb01dence. We then formulate a distribution-\nally robust logistic regression model that minimizes a worst-case expected logloss\nfunction, where the worst case is taken over all distributions in the Wasserstein\nball. We prove that this optimization problem admits a tractable reformulation\nand encapsulates the classical as well as the popular regularized logistic regression\nproblems as special cases. We further propose a distributionally robust approach\nbased on Wasserstein balls to compute upper and lower con\ufb01dence bounds on the\nmisclassi\ufb01cation probability of the resulting classi\ufb01er. These bounds are given by\nthe optimal values of two highly tractable linear programs. We validate our theo-\nretical out-of-sample guarantees through simulated and empirical experiments.\n\n1\n\nIntroduction\n\nLogistic regression is one of the most frequently used classi\ufb01cation methods [1]. Its objective is to\nestablish a probabilistic relationship between a continuous feature vector and a binary explanatory\nvariable. However, in spite of its overwhelming success in machine learning, data analytics and\nmedicine etc., logistic regression models can display a poor out-of-sample performance if training\ndata is sparse. In this case modelers often resort to ad hoc regularization techniques in order to\ncombat over\ufb01tting effects. This paper aims to develop new regularization techniques for logistic\nregression\u2014and to provide intuitive probabilistic interpretations for existing ones\u2014by using tools\nfrom modern distributionally robust optimization.\nLogistic Regression: Let x \u2208 Rn denote a feature vector and y \u2208 {\u22121, +1} the associated binary\nlabel to be predicted. In logistic regression, the conditional distribution of y given x is modeled as\n\nProb(y|x) = [1 + exp(\u2212y(cid:104)\u03b2, x(cid:105))]\n\n(1)\nwhere the weight vector \u03b2 \u2208 Rn constitutes an unknown regression parameter. Suppose that N\ntraining samples {(\u02c6xi, \u02c6yi)}N\ni=1 have been observed. Then, the maximum likelihood estimator of\nclassical logistic regression is found by solving the geometric program\n\n\u22121 ,\n\nN(cid:88)\n\ni=1\n\nmin\n\n\u03b2\n\n1\nN\n\nl\u03b2(\u02c6xi, \u02c6yi) ,\n\n(2)\n\nwhose objective function is given by the sample average of the logloss function l\u03b2(x, y) = log(1 +\nexp (\u2212y(cid:104)\u03b2, x(cid:105))). It has been observed, however, that the resulting maximum likelihood estimator\nmay display a poor out-of-sample performance. Indeed, it is well documented that minimizing the\naverage logloss function leads to over\ufb01tting and weak classi\ufb01cation performance [2, 3]. In order\n\n1\n\n\fto overcome this de\ufb01ciency, it has been proposed to modify the objective function of problem (2)\n[4, 5, 6]. An alternative approach is to add a regularization term to the logloss function in order to\nmitigate over\ufb01tting. These regularization techniques lead to a modi\ufb01ed optimization problem\n\nN(cid:88)\n\ni=1\n\nmin\n\n\u03b2\n\n1\nN\n\nl\u03b2(\u02c6xi, \u02c6yi) + \u03b5R(\u03b2) ,\n\n(3)\n\nwhere R(\u03b2) and \u03b5 denote the regularization function and the associated coef\ufb01cient, respectively.\nA popular choice for the regularization term is R(\u03b2) = (cid:107)\u03b2(cid:107), where (cid:107) \u00b7 (cid:107) denotes a generic norm\nsuch as the (cid:96)1 or the (cid:96)2-norm. The use of (cid:96)1-regularization tends to induce sparsity in \u03b2, which in\nturn helps to combat over\ufb01tting effects [7]. Moreover, (cid:96)1-regularized logistic regression serves as\nan effective means for feature selection. It is further shown in [8] that (cid:96)1-regularization outperforms\n(cid:96)2-regularization when the number of training samples is smaller than the number of features. On\nthe downside, (cid:96)1-regularization leads to non-smooth optimization problems, which are more chal-\nlenging. Algorithms for large scale regularized logistic regression are discussed in [9, 10, 11, 12].\n\nDistributionally Robust Optimization: Regression and classi\ufb01cation problems are typically\nmodeled as optimization problems under uncertainty. To date, optimization under uncertainty has\nbeen addressed by several complementary modeling paradigms that differ mainly in the representa-\ntion of uncertainty. For instance, stochastic programming assumes that the uncertainty is governed\nby a known probability distribution and aims to minimize a probability functional such as the ex-\npected cost or a quantile of the cost distribution [13, 14]. In contrast, robust optimization ignores\nall distributional information and aims to minimize the worst-case cost under all possible uncer-\ntainty realizations [15, 16, 17]. While stochastic programs may rely on distributional information\nthat is not available or hard to acquire in practice, robust optimization models may adopt an overly\npessimistic view of the uncertainty and thereby promote over-conservative decisions.\nThe emerging \ufb01eld of distributionally robust optimization aims to bridge the gap between the con-\nservatism of robust optimization and the speci\ufb01city of stochastic programming: it seeks to minimize\na worst-case probability functional (e.g., the worst-case expectation), where the worst case is taken\nwith respect to an ambiguity set, that is, a family of distributions consistent with the given prior\ninformation on the uncertainty. The vast majority of the existing literature focuses on ambiguity sets\ncharacterized through moment and support information, see e.g. [18, 19, 20]. However, ambiguity\nsets can also be constructed via distance measures in the space of probability distributions such as\nthe Prohorov metric [21] or the Kullback-Leibler divergence [22]. Due to its attractive measure\nconcentration properties, we use here the Wasserstein metric to construct ambiguity sets.\n\nContribution:\nIn this paper we propose a distributionally robust perspective on logistic regres-\nsion. Our research is motivated by the well-known observation that regularization techniques can\nimprove the out-of-sample performance of many classi\ufb01ers. In the context of support vector ma-\nchines and Lasso, there have been several recent attempts to give ad hoc regularization techniques a\nrobustness interpretation [23, 24]. However, to the best of our knowledge, no such connection has\nbeen established for logistic regression. In this paper we aim to close this gap by adopting a new dis-\ntributionally robust optimization paradigm based on Wasserstein ambiguity sets [25]. Starting from\na data-driven distributionally robust statistical learning setup, we will derive a family of regularized\nlogistic regression models that admit an intuitive probabilistic interpretation and encapsulate the\nclassical regularized logistic regression (3) as a special case. Moreover, by invoking recent measure\nconcentration results, our proposed approach provides a probabilistic guarantee for the emerging\nregularized classi\ufb01ers, which seems to be the \ufb01rst result of this type. All proofs are relegated to the\ntechnical appendix. We summarize our main contributions as follows:\n\u2022 Distributionally robust logistic regression model and tractable reformulation: We propose a\ndata-driven distributionally robust logistic regression model based on an ambiguity set induced by\nthe Wasserstein distance. We prove that the resulting semi-in\ufb01nite optimization problem admits\nan equivalent reformulation as a tractable convex program.\n\u2022 Risk estimation: Using similar distributionally robust optimization techniques based on the\nWasserstein ambiguity set, we develop two highly tractable linear programs whose optimal values\nprovide con\ufb01dence bounds on the misclassi\ufb01cation probability or risk of the emerging classi\ufb01ers.\n\u2022 Out-of-sample performance guarantees: Adopting a distributionally robust framework allows\nus to invoke results from the measure concentration literature to derive \ufb01nite-sample probabilistic\n\n2\n\n\fguarantees. Speci\ufb01cally, we establish out-of-sample performance guarantees for the classi\ufb01ers\nobtained from the proposed distributionally robust optimization model.\n\u2022 Probabilistic interpretation of existing regularization techniques: We show that the standard\nregularized logistic regression is a special case of our framework. In particular, we show that the\nregularization coef\ufb01cient \u03b5 in (3) can be interpreted as the size of the ambiguity set underlying\nour distributionally robust optimization model.\n\n2 A distributionally robust perspective on statistical learning\n\nIn the standard statistical learning setting all training and test samples are drawn independently from\nsome distribution P supported on \u039e = Rn \u00d7 {\u22121, +1}. If the distribution P was known, the best\nweight parameter \u03b2 could be found by solving the stochastic optimization problem\n\n(cid:8)EP [l\u03b2(x, y)] =\n\n(cid:90)\n\ninf\n\u03b2\n\nRn\u00d7{\u22121,+1}\n\nl\u03b2(x, y)P(d(x, y))(cid:9).\n\n(4)\n\n(cid:26)(cid:90)\n\n(cid:27)\n\nIn practice, however, P is only indirectly observable through N independent training samples. Thus,\nthe distribution P is itself uncertain, which motivates us to address problem (4) from a distribution-\nally robust perspective. This means that we use the training samples to construct an ambiguity set\nP, that is, a family of distributions that contains the unknown distribution P with high con\ufb01dence.\nThen we solve the distributionally robust optimization problem\n\ninf\n\u03b2\n\nsup\nQ\u2208P\n\nEQ [l\u03b2(x, y)] ,\n\n(5)\n\nwhich minimizes the worst-case expected logloss function. The construction of the ambiguity set\nP should be guided by the following principles. (i) Tractability: It must be possible to solve the\ndistributionally robust optimization problem (5) ef\ufb01ciently. (ii) Reliability: The optimizer of (5)\nshould be near-optimal in (4), thus facilitating attractive out-of-sample guarantees. (iii) Asymptotic\nconsistency: For large training data sets, the solution of (5) should converge to the one of (4). In this\npaper we propose to use the Wasserstein metric to construct P as a ball in the space of probability\ndistributions that satis\ufb01es (i)\u2013(iii).\nDe\ufb01nition 1 (Wasserstein Distance). Let M (\u039e2) denote the set of probability distributions on \u039e\u00d7\u039e.\nThe Wasserstein distance between two distributions P and Q supported on \u039e is de\ufb01ned as\n\nW (Q, P) := inf\n\n\u03a0\u2208M (\u039e2)\n\n\u039e2\n\nd(\u03be, \u03be(cid:48)) \u03a0(d\u03be, d\u03be(cid:48)) : \u03a0(d\u03be, \u039e) = Q(d\u03be), \u03a0(\u039e, d\u03be(cid:48)) = P(d\u03be(cid:48))\n\n,\n\nwhere \u03be = (x, y) and d(\u03be, \u03be(cid:48)) is a metric on \u039e.\n\nThe Wasserstein distance represents the minimum cost of moving the distribution P to the distribu-\ntion Q, where the cost of moving a unit mass from \u03be to \u03be(cid:48) amounts to d(\u03be, \u03be(cid:48)).\nIn the remainder, we denote by B\u03b5(P) := {Q : W (Q, P) \u2264 \u03b5} the ball of radius \u03b5 centered at P with\nrespect to the Wasserstein distance. In this paper we propose to use Wasserstein balls as ambiguity\nsets. Given the training data points {(\u02c6xi, \u02c6yi)}N\ni=1, a natural candidate for the center of the Wasser-\nstein ball is the empirical distribution \u02c6PN = 1\ni=1 \u03b4(\u02c6xi,\u02c6yi), where \u03b4(\u02c6xi,\u02c6yi) denotes the Dirac point\nN\nmeasure at (\u02c6xi, \u02c6yi). Thus, we henceforth examine the distributionally robust optimization problem\n\n(cid:80)N\n\ninf\n\u03b2\n\nsup\n\nQ\u2208B\u03b5( \u02c6PN )\n\nEQ [l\u03b2(x, y)]\n\n(6)\n\nequipped with a Wasserstein ambiguity set. Note that (6) reduces to the average logloss minimization\nproblem (2) associated with classical logistic regression if we set \u03b5 = 0.\n\n3 Tractable reformulation and probabilistic guarantees\n\nIn this section we demonstrate that (6) can be reformulated as a tractable convex program and estab-\nlish probabilistic guarantees for its optimal solutions.\n\n3\n\n\f3.1 Tractable reformulation\n\n(x, y), (x(cid:48), y(cid:48)) \u2208 \u039e is de\ufb01ned as d(cid:0)(x, y), (x(cid:48), y(cid:48))(cid:1) = (cid:107)x \u2212 x(cid:48)(cid:107) + \u03ba|y \u2212 y(cid:48)|/2 , where (cid:107) \u00b7 (cid:107) is\n\nWe \ufb01rst de\ufb01ne a metric on the feature-label space, which will be used in the remainder.\nDe\ufb01nition 2 (Metric on the Feature-Label Space). The distance between two data points\nany norm on Rn, and \u03ba is a positive weight.\nThe parameter \u03ba in De\ufb01nition 2 represents the relative emphasis between feature mismatch and label\nuncertainty. The following theorem presents a tractable reformulation of the distributionally robust\noptimization problem (6) and thus constitutes the \ufb01rst main result of this paper.\nTheorem 1 (Tractable Reformulation). The optimization problem (6) is equivalent to\n\n\u02c6J := inf\n\u03b2\n\nsup\n\nQ\u2208B\u03b5( \u02c6PN )\n\nEQ [l\u03b2(x, y)] =\n\nmin\n\u03b2,\u03bb,si\ns.t.\n\ni=1\n\n\u03bb\u03b5 + 1\nsi\nN\nl\u03b2(\u02c6xi, \u02c6yi) \u2264 si\nl\u03b2(\u02c6xi,\u2212\u02c6yi) \u2212 \u03bb\u03ba \u2264 si\n(cid:107)\u03b2(cid:107)\u2217 \u2264 \u03bb.\n\n\u2200i \u2264 N\n\u2200i \u2264 N\n\n(7)\n\nNote that (7) constitutes a tractable convex program for most commonly used norms (cid:107) \u00b7 (cid:107).\nRemark 1 (Regularized Logistic Regression). As the parameter \u03ba > 0 characterizing the metric\nd(\u00b7,\u00b7) tends to in\ufb01nity, the second constraint group in the convex program (7) becomes redundant.\nHence, (7) reduces to the celebrated regularized logistic regression problem\n\n\u03b5(cid:107)\u03b2(cid:107)\u2217 +\n\n1\nN\n\ninf\n\u03b2\n\nl\u03b2(\u02c6xi, \u02c6yi),\n\nN(cid:88)\n\ni=1\n\nwhere the regularization function is determined by the dual norm on the feature space, while the\nregularization coef\ufb01cient coincides with the radius of the Wasserstein ball. Note that for \u03ba = \u221e\nthe Wasserstein distance between two distributions is in\ufb01nite if they assign different labels to a\n\ufb01xed feature vector with positive probability. Any distribution in B\u03b5( \u02c6PN ) must then have non-\noverlapping conditional supports for y = +1 and y = \u22121. Thus, setting \u03ba = \u221e re\ufb02ects the belief\nthat the label is a (deterministic) function of the feature and that label measurements are exact. As\nthis belief is not tenable in most applications, an approach with \u03ba < \u221e may be more satisfying.\n\n3.2 Out-of-sample performance guarantees\n\nWe now exploit a recent measure concentration result characterizing the speed at which \u02c6PN con-\nverges to P with respect to the Wasserstein distance [26] in order to derive out-of-sample perfor-\nmance guarantees for distributionally robust logistic regression.\nIn the following, we let \u02c6\u039eN := {(\u02c6xi, \u02c6yi)}N\ni=1 be a set of N independent training samples from P,\nand we denote by \u02c6\u03b2, \u02c6\u03bb, and \u02c6si the optimal solutions and \u02c6J the corresponding optimal value of (7).\nNote that these values are random objects as they depend on the random training data \u02c6\u039eN .\nTheorem 2 (Out-of-Sample Performance). Assume that the distribution P is light-tailed, i.e. , there\nis a > 1 with A := EP[exp((cid:107)2x(cid:107)a)] < +\u221e. If the radius \u03b5 of the Wasserstein ball is set to\n\nc2c3\n\nc2N\n\n} +\n\n\u03b5N (\u03b7) =\n\n1{N < log (c1\u03b7\u22121 )\n\nthen we have PN(cid:8)P \u2208 B\u03b5( \u02c6PN )(cid:9) \u2265 1 \u2212 \u03b7, implying that PN{\u02c6\u039eN : EP[l \u02c6\u03b2(x, y)] \u2264 \u02c6J} \u2265 1 \u2212 \u03b7\n\nfor all sample sizes N \u2265 1 and con\ufb01dence levels \u03b7 \u2208 (0, 1]. Moreover, the positive constants c1, c2,\nand c3 appearing in (8) depend only on the light-tail parameters a and A, the dimension n of the\nfeature space, and the metric on the feature-label space.\nRemark 2 (Worst-Case Loss). Denoting the empirical logloss function on the training set \u02c6\u039eN by\nE \u02c6PN\n\n[l \u02c6\u03b2(x, y)], the worst-case loss \u02c6J can be expressed as\n\n1{N\u2265 log (c1\u03b7\u22121 )\n\nc2N\n\n},\n\n(8)\n\nc2c3\n\n(cid:18) log (c1\u03b7\u22121)\n\n(cid:19) 1\n\nn\n\n(cid:18) log (c1\u03b7\u22121)\n\n(cid:19) 1\n\na\n\nN(cid:80)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nN(cid:88)\n\ni=1\n\n\u02c6J = \u02c6\u03bb\u03b5 + E \u02c6PN\n\n[l \u02c6\u03b2(x, y)] +\n\n1\nN\n\n4\n\nmax{0, \u02c6yi(cid:104) \u02c6\u03b2, \u02c6xi(cid:105) \u2212 \u02c6\u03bb\u03ba}.\n\n(9)\n\n\fNote that the last term in (9) can be viewed as a complementary regularization term that does not\nappear in standard regularized logistic regression. This term accounts for label uncertainty and\ndecreases with \u03ba. Thus, \u03ba can be interpreted as our trust in the labels of the training samples. Note\nthat this regularization term vanishes for \u03ba \u2192 \u221e. One can further prove that \u02c6\u03bb converges to (cid:107) \u02c6\u03b2(cid:107)\u2217\nfor \u03ba \u2192 \u221e, implying that (9) reduces to the standard regularized logistic regression in this limit.\nRemark 3 (Performance Guarantees). The following comments are in order:\n\nI. Light-Tail Assumption: The light-tail assumption of Theorem 2 is restrictive but seems to\nbe unavoidable for any a priori guarantees of the type described in Theorem 2. Note that this\nassumption is automatically satis\ufb01ed if the features have bounded support or if they are known\nto follow, for instance, a Gaussian or exponential distribution.\n\nII. Asymptotic Consistency: For any \ufb01xed con\ufb01dence level \u03b7, the radius \u03b5N (\u03b7) de\ufb01ned in (8)\ndrops to zero as the sample size N increases, and thus the ambiguity set shrinks to a singleton.\nTo be more precise, with probability 1 across all training datasets, a sequence of distributions\nin the ambiguity set (8) converges in the Wasserstein metric, and thus weakly, to the unknown\ndata generating distribution P; see [25, Corollary 3.4] for a formal proof. Consequently, the\nsolution of (2) can be shown to converge to the solution of (4) as N increases.\n\nIII. Finite Sample Behavior: The a priori bound (8) on the size of the Wasserstein ball has two\ngrowth regimes. For small N, the radius decreases as N 1\na , and for large N it scales with N 1\nn ,\nwhere n is the dimension of the feature space. We refer to [26, Section 1.3] for further details\non the optimality of these rates and potential improvements for special cases. Note that when\nthe support of the underlying distribution P is bounded or P has a Gaussian distribution, the\nparameter a can be effectively set to 1.\n\n3.3 Risk Estimation: Worst- and Best-Cases\n\nOne of the main objectives in logistic regression is to control the classi\ufb01cation performance. Specif-\nically, we are interested in predicting labels from features. This can be achieved via a classi\ufb01er\nprobability. In logistic regression, a natural choice for the classi\ufb01er is f\u03b2(x) = +1 if Prob(+1|x) >\n0.5; = \u22121 otherwise. The conditional probability Prob(y|x) is de\ufb01ned in (1). The risk associated\n\nfunction f\u03b2 : Rn \u2192 {+1,\u22121}, whose risk R(\u03b2) := P(cid:2)y (cid:54)= f\u03b2(x)(cid:3) represents the misclassi\ufb01cation\nwith this classi\ufb01er can be expressed as R(\u03b2) = EP(cid:2)1{y(cid:104)\u03b2,x(cid:105)\u22640}(cid:3). As in Section 3.1, we can use\n\nworst- and best-case expectations over Wasserstein balls to construct con\ufb01dence bounds on the risk.\nTheorem 3 (Risk Estimation). For any \u02c6\u03b2 depending on the training dataset {(\u02c6xi, \u02c6yi)}N\ni=1 we have:\n\n(i) The worst-case risk Rmax( \u02c6\u03b2) := supQ\u2208B\u03b5( \u02c6PN )\n\nEQ[1{y(cid:104) \u02c6\u03b2,x(cid:105)\u22640}] is given by\n\nmin\n\n\u03bb,si,ri,ti\n\ns.t.\n\nRmax( \u02c6\u03b2) =\n\nsi\n\ni=1\n\n\u03bb\u03b5 + 1\nN\n1 \u2212 ri \u02c6yi(cid:104) \u02c6\u03b2, \u02c6xi(cid:105) \u2264 si\n1 + ti \u02c6yi(cid:104) \u02c6\u03b2, \u02c6xi(cid:105) \u2212 \u03bb\u03ba \u2264 si\nri(cid:107) \u02c6\u03b2(cid:107)\u2217 \u2264 \u03bb,\nti(cid:107) \u02c6\u03b2(cid:107)\u2217 \u2264 \u03bb\nri, ti, si \u2265 0\n\n\u2200i \u2264 N\n\u2200i \u2264 N\n\u2200i \u2264 N\n\u2200i \u2264 N.\n\n(10a)\n\nIf the Wasserstein radius \u03b5 is set to \u03b5N (\u03b7) as de\ufb01ned in (8), then Rmax( \u02c6\u03b2) \u2265 R( \u02c6\u03b2) with\nprobability 1 \u2212 \u03b7 across all training sets {(xi, yi)}N\ni=1.\n(ii) Similarly, the best-case risk Rmin( \u02c6\u03b2) := inf Q\u2208B\u03b5( \u02c6PN )\n\nEQ[1{y(cid:104) \u02c6\u03b2,x(cid:105)<0}] is given by\n\nmin\n\n\u03bb,si,ri,ti\n\ns.t.\n\nRmin( \u02c6\u03b2) = 1 \u2212\n\nsi\n\ni=1\n\n\u03bb\u03b5 + 1\nN\n1 + ri \u02c6yi(cid:104) \u02c6\u03b2, \u02c6xi(cid:105) \u2264 si\n1 \u2212 ti \u02c6yi(cid:104) \u02c6\u03b2, \u02c6xi(cid:105) \u2212 \u03bb\u03ba \u2264 si\nri(cid:107) \u02c6\u03b2(cid:107)\u2217 \u2264 \u03bb,\nti(cid:107) \u02c6\u03b2(cid:107)\u2217 \u2264 \u03bb\nri, ti, si \u2265 0\n\n(10b)\n\n\u2200i \u2264 N\n\u2200i \u2264 N\n\u2200i \u2264 N\n\u2200i \u2264 N.\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nN(cid:80)\n\nN(cid:80)\n\n5\n\n\f(a) N = 10 training samples\n\n(b) N = 100 training samples\n\n(c) N = 1000 training samples\n\nFigure 1: Out-of-sample performance (solid blue line) and the average CCR (dashed red line)\n\nIf the Wasserstein radius \u03b5 is set to \u03b5N (\u03b7) as de\ufb01ned in (8), then Rmin( \u02c6\u03b2) \u2264 R( \u02c6\u03b2) with\nprobability 1 \u2212 \u03b7 across all training sets {(xi, yi)}N\n\ni=1.\n\nWe emphasize that (10a) and (10b) constitute highly tractable linear programs. Moreover, we have\nRmin( \u02c6\u03b2) \u2264 R( \u02c6\u03b2) \u2264 Rmax( \u02c6\u03b2) with probability 1 \u2212 2\u03b7.\n\n4 Numerical Results\n\nWe now showcase the power of distributionally robust logistic regression in simulated and empirical\nexperiments. All optimization problems are implemented in MATLAB via the modeling language\nYALMIP [27] and solved with the state-of-the-art nonlinear programming solver IPOPT [28]. All\nexperiments were run on an Intel XEON CPU (3.40GHz). For the largest instance studied (N =\n1000), the problems (2), (3), (7) and (10) were solved in 2.1, 4.2, 9.2 and 0.05 seconds, respectively.\n\n4.1 Experiment 1: Out-of-Sample Performance\n\nWe use a simulation experiment to study the out-of-sample performance guarantees offered by distri-\nbutionally robust logistic regression. As in [8], we assume that the features x \u2208 R10 follow a multi-\nvariate standard normal distribution and that the conditional distribution of the labels y \u2208 {+1,\u22121}\nis of the form (1) with \u03b2 = (10, 0, . . . , 0). The true distribution P is uniquely determined by this\ninformation. If we use the (cid:96)\u221e-norm to measure distances in the feature space, then P satis\ufb01es the\nlight-tail assumption of Theorem 2 for 2 > a (cid:38) 1. Finally, we set \u03ba = 1.\nOur experiment comprises 100 simulation runs. In each run we generate N \u2208 {10, 102, 103} train-\ning samples and 104 test samples from P. We calibrate the distributionally robust logistic regression\nmodel (6) to the training data and use the test data to evaluate the average logloss as well as the\ncorrect classi\ufb01cation rate (CCR) of the classi\ufb01er associated with \u02c6\u03b2. We then record the percentage\n\u02c6\u03b7N (\u03b5) of simulation runs in which the average logloss exceeds \u02c6J. Moreover, we calculate the av-\nerage CCR across all simulation runs. Figure 1 displays both 1 \u2212 \u02c6\u03b7N (\u03b5) and the average CCR as a\nfunction of \u03b5 for different values of N. Note that 1 \u2212 \u02c6\u03b7N (\u03b5) quanti\ufb01es the probability (with respect\nto the training data) that P belongs to the Wasserstein ball of radius \u03b5 around the empirical distri-\nbution \u02c6PN . Thus, 1 \u2212 \u02c6\u03b7N (\u03b5) increases with \u03b5. The average CCR bene\ufb01ts from the regularization\ninduced by the distributional robustness and increases with \u03b5 as long as the empirical con\ufb01dence\n1 \u2212 \u02c6\u03b7N (\u03b5) is smaller than 1. As soon as the Wasserstein ball is large enough to contain the distribu-\ntion P with high con\ufb01dence (1 \u2212 \u02c6\u03b7N (\u03b5) (cid:46) 1), however, any further increase of \u03b5 is detrimental to\nthe average CCR.\nFigure 1 also indicates that the radius \u03b5 implied by a \ufb01xed empirical con\ufb01dence level scales inversely\nwith the number of training samples N. Speci\ufb01cally, for N = 10, 102, 103, the Wasserstein radius\nimplied by the con\ufb01dence level 1 \u2212 \u02c6\u03b7 = 95% is given by \u03b5 \u2248 0.2, 0.02, 0.003, respectively. This\nobservation is consistent with the a priori estimate (8) of the Wasserstein radius \u03b5N (\u03b7) associated\na (cid:46) N for \u03b5 \u2265 c3.\nwith a given \u03b7. Indeed, as a (cid:38) 1, Theorem 2 implies that \u03b5N (\u03b7) scales with N 1\n\n6\n\n\u03b510-510-410-310-210-11001\u2212\u02c6\u03b700.10.20.30.40.50.60.70.80.91AverageCCR606570758085\u03b510-510-410-310-210-11001\u2212\u02c6\u03b700.10.20.30.40.50.60.70.80.91AverageCCR(%)858789919395\u03b510-510-410-310-210-11001\u2212\u02c6\u03b70.20.40.60.81AverageCCR(%)93.994.194.394.5\f4.2 Experiment 2: The Effect of the Wasserstein Ball\n\nIn the second simulation experiment we study the statistical properties of the out-of-sample logloss.\nAs in [2], we set n = 10 and assume that the features follow a multivariate standard normal distribu-\ntion, while the conditional distribution of the labels is of the form (1) with \u03b2 sampled uniformly from\nthe unit sphere. We use the (cid:96)2-norm in the feature space, and we set \u03ba = 1. All results reported here\nare averaged over 100 simulation runs. In each trial, we use N = 102 training samples to calibrate\nproblem (6) and 104 test samples to estimate the logloss distribution of the resulting classi\ufb01er.\nFigure 2(a) visualizes the conditional value-at-risk (CVaR) of the out-of-sample logloss distribu-\ntion for various con\ufb01dence levels and for different values of \u03b5. The CVaR of the logloss at level\n\u03b1 is de\ufb01ned as the conditional expectation of the logloss above its (1 \u2212 \u03b1)-quantile, see [29]. In\nother words, the CVaR at level \u03b1 quanti\ufb01es the average of the \u03b1 \u00d7 100% worst logloss realizations.\nAs expected, using a distributionally robust approach renders the logistic regression problem more\n\u2018risk-averse\u2019, which results in uniformly lower CVaR values of the logloss, particularly for smaller\ncon\ufb01dence levels. Thus, increasing the radius of the Wasserstein ball reduces the right tail of the\nlogloss distribution. Figure 2(c) con\ufb01rms this observation by showing that the cumulative distribu-\ntion function (CDF) of the logloss converges to a step function for large \u03b5. Moreover, one can prove\nthat the weight vector \u02c6\u03b2 tends to zero as \u03b5 grows. Speci\ufb01cally, for \u03b5 \u2265 0.1 we have \u03b2 \u2248 0, in\nwhich case the logloss approximates the deterministic value log(2) = 0.69. Zooming into the CVaR\ngraph of Figure 2(a) at the end of the high con\ufb01dence levels, we observe that the 100%-CVaR, which\ncoincides in fact with the expected logloss, increases at every quantile level; see Figure 2(b).\n\n4.3 Experiment 3: Real World Case Studies and Risk Estimation\n\nNext, we validate the performance of the proposed distributionally robust logistic regression method\non the MNIST dataset [30] and three popular datasets from the UCI repository: Ionosphere, Thoracic\nSurgery, and Breast Cancer [31]. In this experiment, we use the distance function of De\ufb01nition 2\nwith the (cid:96)1-norm. We examine three different models: logistic regression (LR), regularized logistic\nregression (RLR), and distributionally robust logistic regression with \u03ba = 1 (DRLR). All results\nreported here are averaged over 100 independent trials. In each trial related to a UCI dataset, we\nrandomly select 60% of data to train the models and the rest to test the performance. Similarly, in\neach trial related to the MNIST dataset, we randomly select 103 samples from the training dataset,\nand test the performance on the complete test dataset. The results in Table 1 (top) indicate that\nDRLR outperforms RLR in terms of CCR by about the same amount by which RLR outperforms\nclassical LR (0.3%\u20131%), consistently across all experiments. We also evaluated the out-of-sample\nCVaR of logloss, which is a natural performance indicator for robust methods. Table 1 (bottom)\nshows that DRLR wins by a large margin (outperforming RLR by 4%\u201343%).\nIn the remainder we focus on the Ionosphere case study (the results of which are representative\nfor the other case studies). Figures 3(a) and 3(b) depict the logloss and the CCR for different\nWasserstein radii \u03b5. DRLR (\u03ba = 1) outperforms RLR (\u03ba = \u221e) consistently for all suf\ufb01ciently\nsmall values of \u03b5. This observation can be explained by the fact that DRLR accounts for uncertainty\nin the label, whereas RLR does not. Thus, there is a wider range of Wasserstein radii that result in\n\n(a) CVaR versus quantile of the\nlogloss function\n\n(b) CVaR versus quantile of the\nlogloss function (zoomed)\n\n(c) Cumulative distribution of the\nlogloss function\n\nFigure 2: CVaR and CDF of the logloss function for different Wasserstein radii \u03b5\n\n7\n\nQuantilePercentage020406080100CVaR0123456\u03b5=0\u03b5=0.005\u03b5=0.01\u03b5=0.05\u03b5=0.1\u03b5=0.5QuantilePercentage949596979899100CVaR0.60.70.80.9\u03b5=0\u03b5=0.005\u03b5=0.01\u03b5=0.05\u03b5=0.1\u03b5=0.5logloss0123456CDF00.20.40.60.81\u03b5=0\u03b5=0.005\u03b5=0.01\u03b5=0.05\u03b5=0.1\u03b5=0.5\fTable 1: The average and standard deviation of CCR and CVaR evaluated on the test dataset.\n\nLR\n\nRLR\n\nDRLR\n\nCCR\n\nCVaR\n\nIonosphere\nThoracic Surgery\nBreast Cancer\nMNIST 1 vs 7\nMNIST 4 vs 9\nMNIST 5 vs 6\nIonosphere\nThoracic Surgery\nBreast Cancer\nMNIST 1 vs 7\nMNIST 4 vs 9\nMNIST 5 vs 6\n\n84.8 \u00b1 4.3% 86.1 \u00b1 3.1% 87.0 \u00b1 2.6%\n82.7 \u00b1 2.0% 83.1 \u00b1 2.0% 83.8 \u00b1 2.0%\n94.4 \u00b1 1.8% 95.5 \u00b1 1.2% 95.8 \u00b1 1.2%\n97.8 \u00b1 0.6% 98.0 \u00b1 0.3% 98.6 \u00b1 0.2%\n93.7 \u00b1 1.1% 94.6 \u00b1 0.5% 95.1 \u00b1 0.4%\n94.9 \u00b1 1.6% 95.7 \u00b1 0.5% 96.7 \u00b1 0.4%\n10.5 \u00b1 6.9\n3.0 \u00b1 1.9\n20.3 \u00b1 15.1\n3.9 \u00b1 2.8\n8.7 \u00b1 6.5\n14.1 \u00b1 9.5\n\n4.2 \u00b1 1.5\n2.3 \u00b1 0.3\n1.3 \u00b1 0.4\n0.67 \u00b1 0.13\n1.45 \u00b1 0.20\n1.35 \u00b1 0.20\n\n3.5 \u00b1 2.0\n2.2 \u00b1 0.2\n0.9 \u00b1 0.2\n0.38 \u00b1 0.06\n1.09 \u00b1 0.08\n0.84 \u00b1 0.08\n\n(a) The average logloss for differ-\nent \u03ba\n\n(b) The average correct classi\ufb01ca-\ntion rate for different \u03ba\n\n(c) Risk estimation and its con\ufb01-\ndence level\n\nFigure 3: Average logloss, CCR and risk for different Wasserstein radii \u03b5 (Ionosphere dataset)\n\nan attractive out-of-sample logloss and CCR. This effect facilitates the choice of \u03b5 and could be a\nsigni\ufb01cant advantage in situations where it is dif\ufb01cult to determine \u03b5 a priori.\nIn the experiment underlying Figure 3(c), we \ufb01rst \ufb01x \u02c6\u03b2 to the optimal solution of (7) for \u03b5 = 0.003\nand \u03ba = 1. Figure 3(c) shows the true risk R( \u02c6\u03b2) and its con\ufb01dence bounds. As expected, for \u03b5 = 0\nthe upper and lower bounds coincide with the empirical risk on the training data, which is a lower\nbound for the true risk on the test data due to over-\ufb01tting effects. As \u03b5 increases, the con\ufb01dence\ninterval between the bounds widens and eventually covers the true risk. For instance, at \u03b5 \u2248 0.05\nthe con\ufb01dence interval is given by [0, 0.19] and contains the true risk with probability 1\u22122\u02c6\u03b7 = 95%.\n\nAcknowledgments: This research was supported by the Swiss National Science Foundation under\ngrant BSCGI0 157733.\n\nReferences\n[1] D. W. Hosmer and S. Lemeshow. Applied Logistic Regression. John Wiley & Sons, 2004.\n[2] J. Feng, H. Xu, S. Mannor, and S. Yan. Robust logistic regression and classi\ufb01cation.\n\nAdvances in Neural Information Processing Systems, pages 253\u2013261, 2014.\n\nIn\n\n[3] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A\nconvex programming approach. IEEE Transactions on Information Theory, 59(1):482\u2013494,\n2013.\n\n[4] N. Ding, S. Vishwanathan, M. Warmuth, and V. S. Denchev. t-logistic regression for binary\n\nand multiclass classi\ufb01cation. The Journal of Machine Learning Research, 5:1\u201355, 2013.\n\n[5] C. Liu. Robit Regression: A Simple Robust Alternative to Logistic and Probit Regression,\n\npages 227\u2013238. John Wiley & Sons, 2005.\n\n8\n\n\u03b510-410-310-210-1Averagelogloss024681012RLR(\u03ba=+\u221e)DRLR(\u03ba=1)\u03b510-410-310-210-1AverageCCR0.830.840.850.860.870.88RLR(\u03ba=+\u221e)DRLR(\u03ba=1)\u03b510-510-410-310-210-1100Risk00.20.40.60.81Con\ufb01dence(1\u22122\u02c6\u03b7)00.20.40.60.81TrueRiskUpperBoundLowerBoundCon\ufb01dence\f[6] P. J. Rousseeuw and A. Christmann. Robustness against separation and outliers in logistic\n\nregression. Computational Statistics & Data Analysis, 43(3):315\u2013332, 2003.\n\n[7] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical\n\nSociety: Series B, 58(1):267\u2013288, 1996.\n\n[8] A. Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. In Proceedings\n\nof the Twenty-First International Conference on Machine Learning, pages 78\u201385, 2004.\n\n[9] K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale (cid:96)1-regularized logistic\n\nregression. The Journal of Machine Learning Research, 8:1519\u20131555, 2007.\n\n[10] S. Shalev-Shwartz and A. Tewari. Stochastic methods for (cid:96)1-regularized loss minimization.\n\nThe Journal of Machine Learning Research, 12:1865\u20131892, 2011.\n\n[11] J. Shi, W. Yin, S. Osher, and P. Sajda. A fast hybrid algorithm for large-scale (cid:96)1-regularized\n\nlogistic regression. The Journal of Machine Learning Research, 11:713\u2013741, 2010.\n\n[12] S. Yun and K.-C. Toh. A coordinate gradient descent method for (cid:96)1-regularized convex mini-\n\nmization. Computational Optimization and Applications, 48(2):273\u2013307, 2011.\n\n[13] A. Shapiro, D. Dentcheva, and A. Ruszczyski. Lectures on Stochastic Programming. SIAM,\n\n2009.\n\n[14] J. R. Birge and F. Louveaux. Introduction to Stochastic Programming. Springer, 2011.\n[15] A. Ben-Tal and A. Nemirovski. Robust optimization\u2014methodology and applications. Mathe-\n\nmatical Programming B, 92(3):453\u2013480, 2002.\n\n[16] D. Bertsimas and M. Sim. The price of robustness. Operations Research, 52(1):35\u201353, 2004.\n[17] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University\n\nPress, 2009.\n\n[18] E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with\n\napplication to data-driven problems. Operations Research, 58(3):595\u2013612, 2010.\n\n[19] J. Goh and M. Sim. Distributionally robust optimization and its tractable approximations.\n\nOperations Research, 58(4):902\u2013917, 2010.\n\n[20] W. Wiesemann, D. Kuhn, and M. Sim. Distributionally robust convex optimization. Operations\n\nResearch, 62(6):1358\u20131376, 2014.\n\n[21] E. Erdo\u02d8gan and G. Iyengar. Ambiguous chance constrained problems and robust optimization.\n\nMathematical Programming B, 107(1-2):37\u201361, 2006.\n\n[22] Z. Hu and L. J. Hong. Kullback-Leibler divergence constrained distributionally robust opti-\n\nmization. Technical report, Available from Optimization Online, 2013.\n\n[23] H. Xu, C. Caramanis, and S. Mannor. Robustness and regularization of support vector ma-\n\nchines. The Journal of Machine Learning Research, 10:1485\u20131510, 2009.\n\n[24] H. Xu, C. Caramanis, and S. Mannor. Robust regression and Lasso. IEEE Transactions on\n\nInformation Theory, 56(7):3561\u20133574, 2010.\n\n[25] P. Mohajerin Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the\nWasserstein metric: Performance guarantees and tractable reformulations. http://arxiv.\norg/abs/1505.05116, 2015.\n\n[26] N. Fournier and A. Guillin. On the rate of convergence in Wasserstein distance of the empirical\n\nmeasure. Probability Theory and Related Fields, pages 1\u201332, 2014.\n\n[27] J. L\u00a8ofberg. YALMIP: A toolbox for modeling and optimization in Matlab. In IEEE Interna-\n\ntional Symposium on Computer Aided Control Systems Design, pages 284\u2013289, 2004.\n\n[28] A. W\u00a8achter and L. T. Biegler. On the implementation of an interior-point \ufb01lter line-search\nalgorithm for large-scale nonlinear programming. Mathematical Programming A, 106(1):25\u2013\n57, 2006.\n\n[29] R. T. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of Risk,\n\n2:21\u201342, 2000.\n\n[30] Y. LeCun. The MNIST database of handwritten digits, 1998. http://yann.lecun.com/\n\nexdb/mnist/.\n\n[31] K. Bache and M. Lichman. UCI machine learning repository, 2013. http://archive.\n\nics.uci.edu/ml.\n\n9\n\n\f", "award": [], "sourceid": 977, "authors": [{"given_name": "Soroosh", "family_name": "Shafieezadeh Abadeh", "institution": "cole Polytechnique Fe \u0301de \u0301rale de Lausanne"}, {"given_name": "Peyman", "family_name": "Mohajerin Esfahani", "institution": "cole Polytechnique Fe \u0301de \u0301rale de Lausanne"}, {"given_name": "Daniel", "family_name": "Kuhn", "institution": "cole Polytechnique Fe \u0301de \u0301rale de Lausanne"}]}