{"title": "A First-Order Algorithmic Framework for Distributionally Robust Logistic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 3937, "page_last": 3947, "abstract": "Wasserstein distance-based distributionally robust optimization (DRO) has received much attention lately due to its ability to provide a robustness interpretation of various learning models. Moreover, many of the DRO problems that arise in the learning context admits exact convex reformulations and hence can be tackled by off-the-shelf solvers. Nevertheless, the use of such solvers severely limits the applicability of DRO in large-scale learning problems, as they often rely on general purpose interior-point algorithms. On the other hand, there are very few works that attempt to develop fast iterative methods to solve these DRO problems, which typically possess complicated structures. In this paper, we take a first step towards resolving the above difficulty by developing a first-order algorithmic framework for tackling a class of Wasserstein distance-based distributionally robust logistic regression (DRLR) problem. Specifically, we propose a novel linearized proximal ADMM to solve the DRLR problem, whose objective is convex but consists of a smooth term plus two non-separable non-smooth terms. We prove that our method enjoys a sublinear convergence rate. Furthermore, we conduct three different experiments to show its superb performance on both synthetic and real-world datasets. In particular, our method can achieve the same accuracy up to 800+ times faster than the standard off-the-shelf solver.", "full_text": "A First-Order Algorithmic Framework for Wasserstein\n\nDistributionally Robust Logistic Regression\n\nJiajin Li, Sen Huang, Anthony Man-Cho So\n\nDepartment of Systems Engineering & Engineering Management\n\nThe Chinese University of Hong Kong\n\nShatin, N. T., Hong Kong\n\n{jjli,hsen,manchoso}@se.cuhk.edu.hk\n\nAbstract\n\nWasserstein distance-based distributionally robust optimization (DRO) has received\nmuch attention lately due to its ability to provide a robustness interpretation of\nvarious learning models. Moreover, many of the DRO problems that arise in the\nlearning context admits exact convex reformulations and hence can be tackled\nby off-the-shelf solvers. Nevertheless, the use of such solvers severely limits the\napplicability of DRO in large-scale learning problems, as they often rely on general\npurpose interior-point algorithms. On the other hand, there are very few works\nthat attempt to develop fast iterative methods to solve these DRO problems, which\ntypically possess complicated structures. In this paper, we take a \ufb01rst step towards\nresolving the above dif\ufb01culty by developing a \ufb01rst-order algorithmic framework\nfor tackling a class of Wasserstein distance-based distributionally robust logistic\nregression (DRLR) problem. Speci\ufb01cally, we propose a novel linearized proximal\nADMM to solve the DRLR problem, whose objective is convex but consists of a\nsmooth term plus two non-separable non-smooth terms. We prove that our method\nenjoys a sublinear convergence rate. Furthermore, we conduct three different\nexperiments to show its superb performance on both synthetic and real-world\ndatasets. In particular, our method can achieve the same accuracy up to 800+ times\nfaster than the standard off-the-shelf solver.\n\n1\n\nIntroduction\n\nOne of the basic principles for dealing with the over\ufb01tting phenomenon in statistical learning is\nregularization [23]. Recently, there has been a \ufb02urry of works that aim to interpret regularization\nfrom a distributionally robust optimization (DRO) perspective; see, e.g., [19, 1, 7, 20, 18] and the\nreferences therein. The results in these works not only provide a probabilistic justi\ufb01cation of existing\nregularization techniques but also offer a powerful alternative approach to tackle risk minimization\nproblems. Indeed, it has been shown that the DRO formulations of various statistical learning\nproblems admit polynomial-time solvable and exact convex reformulations [19, 7, 2, 21, 20], which\ncan be tackled by off-the-shelf solvers (e.g., YALMIP). Nevertheless, the use of such solvers severely\nlimits the applicability of the DRO approach in large-scale learning problems, as they often rely on\ngeneral-purpose interior-point algorithms. On the other hand, there are very few works that address\nthe design of fast iterative methods for solving the convex reformulations of DRO problems. This is\nin part due to the complicated structures that are often possessed by such reformulations. In fact, it is\nonly recently that researchers have proposed stochastic gradient descent (SGD) algorithms for DRO\nwith f-divergence-based ambiguity sets [17]. However, f-divergence measures can only compare\ndistributions with the same support, while the Wasserstein distances do not have such a restriction.\nOn another front, the works [15, 11] propose cutting-surface methods to deal with Wasserstein\ndistance-based DRO problems. However, they tend to suffer a large computational burden.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this paper, we take a \ufb01rst step towards bridging the above-mentioned gap by proposing a new\n\ufb01rst-order algorithmic framework for solving the class of Wasserstein distance-based distributionally\nrobust logistic regression (DRLR) problems considered in [19]. The starting point of our investigation\nis the following reformulation result; see Theorem 1 and Remark 2 in [19]:\n\n(cid:0)(cid:96)\u03b2(\u02c6xi, \u02c6yi) + max{\u02c6yi\u03b2T \u02c6xi \u2212 \u03bb\u03ba, 0}(cid:1)\n\nN(cid:88)\n\ni=1\n\n(1.1)\n\ninf\n\u03b2\n\nsup\n\nQ\u2208B\u0001(\u02c6PN )\n\nE(x,y)\u223cQ[(cid:96)\u03b2(x, y)] (cid:44) inf\n\n\u03b2, \u03bb\n\n\u03bb\u0001 +\n\n1\nN\n\ns.t. (cid:107)\u03b2(cid:107)\u2217 \u2264 \u03bb.\n\n,\n\nN\n\ninf\n\n\u0398\u00d7\u0398\n\n(cid:26)(cid:90)\n\n\u03a0\u2208P(\u0398\u00d7\u0398)\n\n(cid:80)N\n\nHere, x \u2208 Rn denotes a feature vector and y \u2208 {\u22121, +1} its associated label to be predicted;\n(cid:96)\u03b2(x, y) = log(1 + exp(\u2212y\u03b2T x)) is the log-loss associated with the feature-label pair (x, y) and\nregression parameter \u03b2 \u2208 Rn; {(\u02c6xi, \u02c6yi)}N\ni=1 are N training samples drawn from an unknown\nunderlying distribution P\u2217 on the feature-label space \u0398 = Rn \u00d7 {\u22121, +1}; \u02c6PN = 1\ni=1 \u03b4(\u02c6xi,\u02c6yi)\ndenotes the empirical distribution associated with the training samples {(\u02c6xi, \u02c6yi)}N\ni=1; B\u0001(\u02c6PN ) =\n{Q \u2208 P(\u0398) : W (Q, \u02c6PN ) \u2264 \u0001} is the ball in the space P(\u0398) of probability distributions on \u0398 that is\n(cid:27)\ncentered at the empirical distribution \u02c6PN and has radius \u0001 with respect to the Wasserstein distance\nd(\u03be, \u03be(cid:48))\u03a0(d\u03be, d\u03be(cid:48)) : \u03a0(d\u03be, \u0398) = Q(d\u03be), \u03a0(\u0398, d\u03be(cid:48)) = \u02c6PN(d\u03be(cid:48))\nW (Q, \u02c6PN ) =\nwhere \u03be = (x, y) \u2208 \u0398, d(\u03be, \u03be(cid:48)) = (cid:107)x \u2212 x(cid:48)(cid:107) + \u03ba\n2|y \u2212 y(cid:48)| is the transport cost between two data\npoints \u03be, \u03be(cid:48) \u2208 \u0398 induced by a generic norm (cid:107) \u00b7 (cid:107) on Rn with (cid:107) \u00b7 (cid:107)\u2217 being its dual norm, and \u03ba > 0\nis a parameter that represents the reliability of the label measurements (the larger the \u03ba, the more\nreliable are the measurements; when \u03ba = \u221e, the measurements are error-free). The formulation\non the left-hand side of (1.1) is motivated by the desire to construct an ambiguity set around the\nempirical distribution \u02c6PN that contains the true distribution P\u2217, so that the resulting classi\ufb01er has\ngood out-of-sample performance. We refer the reader to [19] for a more detailed discussion.\nA natural question that arises from (1.1) is how to solve the convex optimization problem on the\nright-hand side (RHS) ef\ufb01ciently. When \u03ba = \u221e, the RHS of (1.1) reduces to a classic regularized\nlogistic regression problem [19, Remark 1]. As such, a host of practically ef\ufb01cient \ufb01rst-order methods\n(such as proximal gradient-type methods or stochastic (variance-reduced) gradient methods) with\nprovable convergence guarantees (see, e.g., [22, 24, 28, 2]) can be applied. However, the algorithmic\naspects of the practically more relevant case where \u03ba < \u221e have not been well explored. Our proposed\nframework for tackling this case consists of two steps. First, by considering the optimality conditions\nof the RHS of (1.1), we can derive an upper bound \u03bbU on the optimal \u03bb\u2217. This suggests that we can\n\ufb01rst initialize \u03bb to a value in [0, \u03bbU ] and solve the resulting problem that involves only the variable \u03b2\n(the \u03b2-subproblem), then apply golden-section search to update \u03bb, and then repeat the whole process\nuntil we \ufb01nd the optimal solution to (1.1). Second, which is the core step of our framework, is to\ndesign a fast iterative method for solving the \u03b2-subproblem. By treating \u03bb as a constant, the RHS\nof (1.1) is equivalent to\n\nN(cid:88)\n\ni=1\n\ninf\n\n(cid:107)\u03b2(cid:107)\u2217\u2264\u03bb\n\n1\nN\n\n(cid:0)h(\u02c6yi\u03b2T \u02c6xi) + max{\u02c6yi\u03b2T \u02c6xi \u2212 \u03bb\u03ba, 0}(cid:1)\n\nwith h(u) = log(1 + exp(\u2212u)). Although (1.2) has a relatively simple norm-ball constraint,\nits objective is non-smooth and non-separable. As such, most existing \ufb01rst-order methods (e.g.,\nprojected/proximal subgradient methods) are ill-suited for tackling it. To proceed, we apply the\noperator splitting technique to reformulate (1.2) as\n\n(1.2)\n\n(1.3)\n\nN(cid:88)\n\n1\nN\n\n(h(\u00b5i) + max{\u00b5i \u2212 \u03bb\u03ba, 0})\n\ninf\n\u03b2, \u00b5\ns.t. Z\u03b2 \u2212 \u00b5 = 0, (cid:107)\u03b2(cid:107)\u2217 \u2264 \u03bb,\n\ni=1\n\nwhere Z is the N \u00d7 n matrix whose i-th row is \u02c6yi \u02c6xT\ni , and propose a new linearized proximal\nalternating direction method of multipliers (LP-ADMM) to fully exploit the structure of (1.3). In\nparticular, our method differs substantially from the commonly used ADMM-variants in the literature\n[27, 14, 6, 12, 8, 9, 25] in the updates of the variables. For the \u03b2-update, we solve a norm-constrained\n\n2\n\n\fquadratic optimization problem. Since such a problem can be rather ill-conditioned, we provide\nthree different types of solvers to handle this task, namely, the accelerated projected gradient descent,\ncoordinate minimization [10], and active set conjugate gradient methods [5]. For the \u00b5-update,\nobserving that the coupling matrix for \u00b5 in the linear equality constraint is the identity, the augmented\nLagrangian function is already locally strongly convex in \u00b5. Hence, instead of using a quadratic\napproximation of h(\u00b7) as in the vanilla proximal ADMM, we use a \ufb01rst-order approximation without\n(cid:41)\nstep size selection; i.e.,\n\ni )\u00b5i + max{\u00b5i \u2212 \u03bb\u03ba, 0}(cid:17) \u2212 (wk)T (Z\u03b2k+1 \u2212 \u00b5) +\n\n(cid:107)\u00b5 \u2212 Z\u03b2k+1(cid:107)2\n\nN(cid:88)\n\n(cid:40)\n\n(cid:16)\n\n(\u00b5k\n\nh\n\n(cid:48)\n\n,\n\n2\n\n\u00b5k+1 = arg min\n\n\u00b5\n\n\u03c1\n2\n\n1\nN\n\ni=1\n\nwhere w \u2208 RN is the dual variable associated with the linear equality constraint in (1.3) and \u03c1 > 0\nis the penalty parameter in the augmented Lagrangian function. On the theoretical side, we prove\nthat our proposed LP-ADMM enjoys an O( 1\nK ) convergence rate under standard assumptions. On the\nnumerical side, we demonstrate via extensive experiments that our proposed method can be sped up\nsubstantially by adopting a geometrically increasing step size strategy. In particular, our method can\nachieve a hundred-fold speedup over the standard solver (which is the only other method that has\nbeen used so far to solve (1.1)) on both synthetic and real-world datasets without the need to tune\nan optimal penalty parameter in every iteration. To the best of our knowledge, our work is the \ufb01rst\nto propose a \ufb01rst-order algorithmic framework for solving the Wasserstein distance-based DRLR\nproblem (1.1) for any \u03ba > 0. Moreover, the proposed framework is suf\ufb01ciently general that it can\npotentially be applied to other DRO problems, which could be of independent interest.\n\n2 Preliminaries\nLet us introduce some basic de\ufb01nitions and concepts. To allow for greater generality, consider the\nfollowing problem:\n\nF (x, y) = f (y) + P (y) + g(x)\n\nx,y\n\nminimize\nsubject to Ax \u2212 y = 0.\n\n(2.1)\nHere, f : RN \u2192 R is a closed convex function that is continuously differentiable on int(dom(f ))\nwith linear operator A \u2208 RN\u00d7n; P : RN \u2192 R \u222a {+\u221e} is a closed proper convex function; g(x) is\nthe indicator function of a norm ball. It should be clear that problem (2.1) includes the \u03b2-subproblem\n(1.3) as a special case. Indeed, the latter can be written as\n\nF (\u00b5, \u03b2) = f (\u00b5) + P (\u00b5) + g(\u03b2)\n\n(cid:80)N\n\n\u00b5,\u03b2\n\nminimize\nsubject to Z\u03b2 \u2212 \u00b5 = 0,\n\n(cid:8)log(1 + exp(\u2212\u00b5i)) + 1\n\n2 (\u00b5i \u2212 \u03bb\u03ba)(cid:9), P (\u00b5) = 1\n\n(cid:80)N\ni=1 |\u00b5i \u2212 \u03bb\u03ba|, and\n\nwhere f (\u00b5) = 1\nN\ng(\u03b2) = I{(cid:107)\u03b2(cid:107)\u2217\u2264\u03bb}. Now, the augmented Lagrangian function associated with (2.1) is given by\n\ni=1\n\n2N\n\nL\u03c1(x, y; w) = f (y) + P (y) + g(x) \u2212 wT (Ax \u2212 y) +\n\n(2.2)\nwhere w is the multiplier. We use (X \u2217,Y\u2217) to denote the solution set of (2.1). A point (x\u2217, y\u2217) is\noptimal for (2.1) if there exists a w\u2217 such that the following KKT conditions are satis\ufb01ed:\n\n(cid:107)Ax \u2212 y(cid:107)2\n2,\n\n\u03c1\n2\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 AT w\u2217 \u2208 \u2202g(x\u2217),\n\n\u2212 w\u2217 \u2208 \u2207f (y\u2217) + \u2202P (y\u2217),\nAx\u2217 \u2212 y\u2217 = 0.\n\n(2.3)\n\nAssumption 2.1. There exists a point (x\u2217, y\u2217, w\u2217) satisfying the KKT conditions in (2.3).\nAssumption 2.2. The gradient of the function f is Lipschitz continuous; i.e., there exists a constant\nLf > 0 such that\nDe\ufb01nition 2.3 (Bregman Divergence). Let f : \u2126 \u2192 R be a function that is a) strictly convex, b)\ncontinuously differentiable, and c) de\ufb01ned on a closed convex set \u2126. The Bregman divergence with\nrespect to f is de\ufb01ned as\n\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107) \u2264 Lf(cid:107)x \u2212 y(cid:107),\u2200x, y.\n\nBf (x, y) = f (x) \u2212 f (y) \u2212 (cid:104)\u2207f (y), x \u2212 y(cid:105).\n\n3\n\n\f3 First-Order Algorithmic Framework\nIn this section, we present our \ufb01rst-order algorithmic framework for solving the DRLR problem. For\nconcreteness\u2019 sake, we take (cid:107) \u00b7 (cid:107) in the transport cost to be the (cid:96)1-norm in this paper. However, it\nshould be mentioned that our framework is general enough to handle other norms as well.\n\nN(cid:88)\n\n1\nN\n\nmin\n\u03b2, \u00b5\ns.t. Z\u03b2 \u2212 \u00b5 = 0,\n(cid:107)\u03b2(cid:107)\u221e \u2264 \u03bb.\n\ni=1\n\n(h(\u00b5i) + max{\u00b5i \u2212 \u03bb\u03ba, 0})\n\n\u03b2 Exact Update\n\n(cid:13)(cid:13)(cid:13)(cid:13)Z\u03b2 \u2212 \u00b5k \u2212 wk\n\n\u03c1\n\n(cid:107)\u03b2(cid:107)\u221e \u2264 \u03bb.\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n(C)\n\nmin\n\u03b2\ns.t.\n\n\u03b2-subproblem (LP-ADMM)\n\n(B)\n\nFix \u03bb\n\nmin\n\u03b2, \u03bb\n\ns.t.\n\nmin\n\u03b2, s, \u03bb\n\ns.t.\n\nN(cid:80)\n\ni=1\n\n\u03bb\u0001 + 1\n(h(\u02c6yi\u03b2T \u02c6xi)+\nN\nmax{\u02c6yi\u03b2T \u02c6xi \u2212 \u03bb\u03ba, 0})\n(cid:107)\u03b2(cid:107)\u221e \u2264 \u03bb.\n\nExact Reformulation\n\nN(cid:88)\n\n1\nN\n\nsi\n\ni=1\n\n\u03bb\u0001 +\n(cid:96)\u03b2(\u02c6xi, \u02c6yi) \u2264 si, i \u2208 [N ],\n(cid:96)\u03b2(\u02c6xi,\u2212\u02c6yi) \u2212 \u03bb\u03ba \u2264 si i \u2208 [N ],\n(cid:107)\u03b2(cid:107)\u221e \u2264 \u03bb.\n\n(A)\n\nFirst-Order Algorithmic Framework\n\nDRLR Problem\n\nFigure 1: First-order algorithmic framework for Wasserstein DRLR with (cid:96)1-induced transport cost\n\nWe summarize the key components of our \ufb01rst-order algorithmic framework in Figure 1. As shown\nin [19], the original DRLR problem (i.e., LHS of (1.1)) can be reformulated as the convex program (A)\nusing strong duality. A standard approach to tackling problem (A) is to use an off-the-shelf solver\n(e.g., YALMIP). To develop an ef\ufb01cient algorithmic framework, we focus on the RHS of (1.1) and\nproceed in two steps. Motivated by the structure of the RHS of (1.1), a natural \ufb01rst step is to \ufb01x \u03bb to\na certain value to obtain the problem (1.2), which involves only the variable \u03b2 and will be referred to\nas the \u03b2-subproblem in the sequel. The second, which is also the core step of our framework, is to\ndesign a fast iterative algorithm to tackle the \u03b2-subproblem (1.2). The main dif\ufb01culty of problem (1.2)\ncomes from the two non-smooth non-separable terms. To overcome this dif\ufb01culty, we introduce the\nauxiliary variable \u00b5i = \u02c6yi\u03b2T \u02c6xi to split the non-separable non-smooth term max{\u02c6yi\u03b2T \u02c6xi \u2212 \u03bb\u03ba, 0},\nthus leading to problem (B). Then, we propose a novel linearized proximal ADMM (LP-ADMM)\nalgorithm to solve it ef\ufb01ciently. As will be shown in Section 4, the proposed LP-ADMM will\nconverge at the rate O(1/K) when applied to the \u03b2-subproblem (B). In each iteration of our LP-\nADMM algorithm, we perform an exact minimization for the \u03b2-update, which entails solving the\nbox-constrained quadratic optimization problem (C) (here, wk denotes the corresponding Lagrange\nmultiplier). Towards that end, we provide three alternative solvers for problem (C), which target three\ndifferent settings. Speci\ufb01cally, we use accelerated projected gradient descent in the well-conditioned\ncase; coordinate minimization [10] in the high-dimensional case N (cid:28) d; active set conjugate gradient\nmethod [5] in the ill-conditioned case. The details are given in Appendix B.\nTo implement the above framework, let us \ufb01rst show that there is a \ufb01nite upper bound \u03bbU on the\noptimal \u03bb\u2217 to problem (A). Observe that the objective function in the RHS of (1.1) takes the form\n\n\u2126(\u03bb, \u03b2) = \u03bb\u0001 +\n\n1\nN\n\n(cid:0)h(\u02c6yi\u03b2T \u02c6xi) + max{\u02c6yi\u03b2T \u02c6xi \u2212 \u03bb\u03ba, 0}(cid:1) + I{(cid:107)\u03b2(cid:107)\u221e\u2264\u03bb}.\n\nN(cid:88)\n\ni=1\n\nNow, let q(\u03bb) = inf \u03b2 \u2126(\u03bb, \u03b2). As the function \u2126(\u00b7,\u00b7) is jointly convex, we can conclude that q(\u00b7) is\na convex (and hence unimodal) function on R. Furthermore, the DRLR problem (A) satis\ufb01es the\nMangasarian-Fromovitz constraint quali\ufb01cation (MFCQ), which implies that its KKT conditions are\nnecessary and suf\ufb01cient for optimality. As the following proposition shows, we can use the KKT\nsystem of problem (A) to derive the desired upper bound on \u03bb\u2217:\n\n4\n\n\f1\nN\n\n\u03bb\u0001 +\n\nmin\nsi\n\u03b2, s, \u03bb\ns.t. (cid:96)\u03b2(\u02c6xi, \u02c6yi) \u2264 si,\n\ni=1\n\nN(cid:88)\n\ni \u2208 [N ],\n\ni \u2208 [N ],\n\n(cid:96)\u03b2(\u02c6xi,\u2212\u02c6yi) \u2212 \u03bb\u03ba \u2264 si,\n\n\u21d2\n\ni \u03b2 \u2264 \u03bb, i \u2208 [N ],\neT\n\u2212eT\ni \u03b2 \u2264 \u03bb, i \u2208 [N ]\n\nai1\u2207\u03b2(cid:96)\u03b2(\u02c6xi, \u02c6yi) + ai2\u2207\u03b2(cid:96)\u03b2(\u02c6xi,\u2212\u02c6yi) + (ai3 \u2212 ai4)ei = 0,\n\n\u03baai2 + ai3 + ai4 = \u0001,\n\n, i \u2208 [N ],\n\n1\nN\n\nai1 + ai2 =\nai1((cid:96)\u03b2(\u02c6xi, \u02c6yi) \u2212 si) = 0, i \u2208 [N ],\nai2((cid:96)\u03b2(\u02c6xi,\u2212\u02c6yi) \u2212 \u03bb\u03ba \u2212 si) = 0, i \u2208 [N ],\nai3(eT\nai4(eT\naij \u2265 0, i \u2208 [N ], j \u2208 [4].\n\ni \u03b2 \u2212 \u03bb) = 0, i \u2208 [N ],\ni \u03b2 + \u03bb) = 0, i \u2208 [N ],\n\nN(cid:88)\nN(cid:88)\n\ni=1\n\ni=1\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nN(cid:88)\n\ni=1\n\nProposition 3.1. Suppose that (\u03b2\u2217, \u03bb\u2217, s\u2217) is an optimal solution to problem (A) in Figure 1. Then,\nwe have \u03bb\u2217 \u2264 \u03bbU = 0.2785\nProof. Using aij, where i = 1, . . . , N and j = 1, . . . , 4 to denote the multipliers associated with the\nconstraints in problem (A), we can write down the KKT conditions of problem (A) as follows:\n\n.\n\n\u0001\n\nAfter some elementary manipulations (see Appendix A for details), we obtain\n\n\u03bb \u2264 1\nN \u0001\n\n\u02c6yi\u03b2T \u02c6xi exp(\u2212\u02c6yi\u03b2T \u02c6xi)\n1 + exp(\u2212\u02c6yi\u03b2T \u02c6xi)\n\n\u2264 0.2785\n\n\u0001\n\n,\n\nas desired.\nRemark 3.2. Although Proposition 3.1 applies to the case where the transport cost is induced by the\n(cid:96)1-norm, the techniques used to prove it can also carry over to the (cid:96)2 and (cid:96)\u221e cases. All we need to\ndo is to modify the parts highlighted in blue above. Indeed, when the transport cost is induced by the\n(cid:96)2-norm, the norm constraint in problem (A) becomes (cid:107)\u03b2(cid:107)2 \u2264 \u03bb, which is equivalent to (cid:107)\u03b2(cid:107)2\n2 \u2264 \u03bb2.\nOn the other hand, when the transport cost is induced by the (cid:96)\u221e-norm, the norm constraint becomes\n(cid:107)\u03b2(cid:107)1 \u2264 \u03bb, which can be expressed as B\u03b2 \u2264 \u03bbe2n with B being the 2n \u00d7 n matrix whose rows are\nall the possible arrangements of +1\u2019s and \u22121(cid:48)s.\nProposition 3.1, together with the unimodality of q(\u00b7), suggests the following natural strategy for\n\ufb01nding an optimal solution (\u03b2\u2217, \u03bb\u2217, s\u2217) to problem (A): initialize \u03bb in (A) to a value in [0, \u03bbU ], solve\nthe resulting \u03b2-subproblem (B), apply golden-section search to update \u03bb, and repeat. The pseudo-code\nfor the golden-section search on \u03bb can be found in Appendix B. The \u03b2-subproblem (B) will be solved\nby our proposed LP-ADMM, which we present next.\n4 LP-ADMM for the \u03b2-Subproblem and Its Convergence Analysis\nTo simplify notation, we consider the prototypical form (2.1) of the \u03b2-subproblem here. It can be\nshown that the \u03b2-subproblem (B) satis\ufb01es Assumptions 2.1 and 2.2 in Section 2. Now, we present\nour proposed LP-ADMM in Algorithm 1.\nThe x-update is standard in ADMM-type algorithms and leads to a box-constrained quadratic\noptimization problem. The crux of our algorithm lies in the local model used to perform the y-\nupdate. To understand the local model, observe that since the coupling matrix for y in the constraint\nAx \u2212 y = 0 is the identity, the augmented Lagrangian function L\u03c1(\u00b7,\u00b7;\u00b7) in (2.2) is strongly convex\nin y. Thus, instead of using the quadratic approximation of f (\u00b7) as in the vanilla proximal ADMM,\nwe can use the \ufb01rst-order approximation y (cid:55)\u2192 \u02c6f (y; yk) = f (yk) + \u2207f (yk)T (y \u2212 yk) at the current\niterate yk. This leads to the y-update\n\n(cid:110) \u02c6f (y; yk) \u2212 (cid:104)wk, Axk+1 \u2212 y(cid:105) +\n\n(cid:111)\n\n(cid:107)y \u2212 Axk+1(cid:107)2\n\n2 + P (y)\n\n,\n\n\u03c1\n2\n\nyk+1 = arg min\n\ny\n\nwhich, as can be easily veri\ufb01ed, is equivalent to the update given in Algorithm 1. In fact, using\nsuch a \ufb01rst-order local model not only makes the resulting algorithm converge faster in practice but\nalso eliminates the need to perform step size selection. The latter makes our algorithmic framework\nnumerically more robust in general.\n\n5\n\n\fAlgorithm 1: Linearized Proximal ADMM (LP-ADMM) for Solving (2.1)\nInput: Choose initial point (x0, y0, w0) \u2208 Rn \u00d7 RN \u00d7 RN and number of iterations K;\nInitialized the penalty parameter \u03c10 and shrinking parameter \u03b3 \u2265 1;\nOutput: {(xk, yk, wk)}K\n1 for each iteration do\n2\n\nk=1 and {F (xk, yk)}K\n\nk=1;\n\nxk+1 = arg min\nx\u2208Rn\n\n(cid:41)\n\n+ g(x)\n\n;\n\n(cid:40)\n(cid:40)\n\n\u03c1k\n2\n\n\u03c1k\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)Ax \u2212 yk \u2212 wk\n(cid:13)(cid:13)(cid:13)(cid:13)y \u2212\n(cid:18)\n\n\u03c1k\n\nAxk+1 \u2212 wk + \u2207f (yk)\n\nyk+1 = arg min\ny\u2208RN\nwk+1 = wk \u2212 \u03c1k(Axk+1 \u2212 yk+1);\n\u03c1k+1 = \u03b3\u03c1k (in particular, if \u03b3 = 1, then \u03c1k = \u03c1k+1 = \u03c1);\n\n\u03c1k\n\n(cid:41)\n\n+ P (y)\n\n;\n\n(cid:19)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n3 end\n\nNext, let us analyze the convergence behavior of the LP-ADMM. Based on the de\ufb01nition of the\naugmented Lagrangian function in (2.2), the optimality conditions of the subproblems in Algorithm 1\ncan be written as follows:\n\n(cid:18)\n\n(cid:19)\n\n0 \u2208 \u03c1AT\n\nAxk+1 \u2212 yk \u2212 wk\n\u03c1\n\n+ \u2202g(xk+1),\n\n0 \u2208 \u2207f (yk) + \u03c1\n\nyk+1 \u2212 Axk+1 +\n\nwk\n\u03c1\n\n+ \u2202P (yk+1).\n\n(cid:18)\n\n(cid:19)\n\n(4.1)\n\n(4.2)\n\nUsing (4.1) and (4.2), we can establish the following basic properties concerning the iterates of our\nproposed LP-ADMM. The proofs can be found in Appendix A.\n\u221a\nProposition 4.1. Suppose that we use a constant penalty parameter \u03c1 that satis\ufb01es \u03c1 > (\n3 + 1)Lf .\nLet {(xk, yk, wk)}k\u22650 be the sequence generated by the LP-ADMM and (x\u2217, y\u2217, w\u2217) be a point\nsatisfying the KKT conditions (2.3) with x\u2217 \u2208 X , y\u2217 \u2208 Y. Then, the following hold:\n(a) For all k \u2265 1, (cid:107)Axk+1 \u2212 yk(cid:107)2\n(b) For all k \u2265 0 and (x, y) satisfying Ax\u2212y = 0, we have F (xk+1, yk+1)\u2212F (x, y) \u2264 1\n\n\u03c12 (cid:107)yk \u2212 yk\u22121(cid:107)2\n2.\n\n2(cid:107)yk+1 \u2212 yk(cid:107)2\n\n2 \u2212 Lf\n\n2 \u2265 1\n\n2\n\n2\u2212(cid:107)yk+1\u2212 y(cid:107)2\n\n2) + c((cid:107)yk \u2212 yk\u22121(cid:107)2\n\n2\u2212(cid:107)yk+1\u2212 yk(cid:107)2\n\n2\u03c1 ((cid:107)wk(cid:107)2\n2\u2212\n2) + (Bf (y, yk+1)\u2212\n\n2) + \u03c1\n\n2 ((cid:107)yk \u2212 y(cid:107)2\n\n(cid:107)wk+1(cid:107)2\nBf (y, yk)), where c = \u03c1\u22122Lf\n2\u03c1(cid:107)wk \u2212 w\u2217(cid:107)2\n\n(c) The sequence { 1\n\n4\n\n.\n\nbelow.\n\n2 + \u03c1\n\n2(cid:107)yk \u2212 y\u2217(cid:107)2\n\n2 \u2212 Bf (y\u2217, yk)}k\u22650 is non-increasing and bounded\n\n(cid:80)K\n\n(cid:80)K\n\nArmed with Proposition 4.1, we can prove the main convergence theorem for LP-ADMM.\nTheorem 4.2. Consider the setting of Proposition 4.1. Set \u00afxK = 1\nK\n1\nK\n\nk=1 yk. Then, the following hold:\n\nk=1 xk and \u00afyK =\n\n(a) The sequence {(xk, yk, wk)}k\u22650 converges to a KKT point of problem (2.1).\n(b) The sequence of function values converges at the rate O( 1\n\nK ):\n\n2 + (\u03c1/2)(cid:107)y\u2217 \u2212 y0(cid:107)2\n\nF (\u00afxK, \u00afyK) \u2212 F (x\u2217, y\u2217) \u2264 (1/2\u03c1)(cid:107)w0(cid:107)2\n\n= O(\n2(cid:107)y\u2212 yk(cid:107)2,\nRemark 4.3. The standard linearized ADMM in [14, 26, 25] involves the quadratic term \u03b7\nwhere \u03b7 needs to satisfy \u03b7 > Lf . Our LP-ADMM can be regarded as a linearized ADMM with \u03b7 = 0.\nUsing the \ufb01rst-order local model, the LP-ADMM achieves the fastest single-step update. Moreover, it\nis worth noting that the adaptive penalty strategy works well in practice, especially the geometrical\nincreasing one (i.e., the blue line in Algorithm 1).\n\n2 + c(cid:107)y0 \u2212 y1(cid:107)2\n\n1\nK\n\nK\n\n).\n\n2\n\n6\n\n\f5 Experiment Results\nIn this section, we present numerical results to demonstrate the effectiveness and ef\ufb01ciency of the\ndifferent components in our proposed algorithmic framework. All experiments were conducted using\nMATLAB R2018a on a computer running Windows 10 with Intel R(cid:13) CoreTM i5-8600 CPU (3.10 GHz)\nand 16 GB RAM. We conducted three different experiments to validate our theoretical results and\nshow the high ef\ufb01ciency of our implementation of the proposed \ufb01rst-order algorithmic framework.\nTo begin, we compare the CPU time of our framework with the YALMIP solver used by [19] on both\nsynthetic and real datasets. Then, we present an empirical comparison of our LP-ADMM with other\nbaseline \ufb01rst-order algorithms, including Projected SubGradient Method (SubGradient), Primal-Dual\nHybrid Gradient (PDHG), Linearized-ADMM and Standard ADMM, on the \u03b2-subproblem. Lastly,\nwe show the test data performance of the DRLR model on real datasets. We use the active set\nconjugate gradient method to solve the box-constrained quadratic optimization problem (C) in this\nsection. Our code is available at https://github.com/gerrili1996/DRLR_NIPS2019_exp.\n\n5.1 CPU Time Comparison with the YALMIP Solver\nOur setup for the synthetic experiments is as follows. We \ufb01rst generate \u03b2 from the standard n-\ndimensional Gaussian distribution N (0, In) and normalize it to obtain the ground truth \u03b2\u2217 = \u03b2/(cid:107)\u03b2(cid:107).\nNext, we generate the feature vectors {\u02c6xi}N\ni=1 independently and identically (i.i.d) from N (0, In) and\nthe noisy measurements {zi}N\ni=1 i.i.d from the uniform distribution over [0, 1]. Lastly, we compute\n1+exp(\u2212\u03b2T\u2217 \u02c6xi) ) \u2212 1. We set the DRLR model\nthe ground truth labels {\u02c6yi}N\ni=1 via \u02c6yi = 2 \u00d7 int(zi <\nparameters to be \u03ba = 1, \u0001 = 0.1 and the default parameters of our Adaptive LP-ADMM to be\n\u03c10 = 0.001, \u03b3 = 1.05. All the experiment results reported here were averaged over 30 independent\ntrials over random seeds. Table 1 summarizes the comparison of CPU times on different scales in the\nsynthetic setting. Our experiment results indicate that the proposed LP-ADMM with adaptive penalty\nstrategy can be over 800 times faster than YALMIP, a state-of-the-art optimization solver, and the\nperformance gap grows considerably with problem size.\n\n1\n\nTable 1: CPU time comparison: LP-ADMM vs. YALMIP (used in [19]) in the synthetic setting\n\n(N, d)\n(10,3)\n(100,3)\n(100,10)\n(500,10)\n(500,50)\n(1000,50)\n(1000,100)\n(3000,50)\n(3000,100)\n(5000,100)\n(10000,10)\n(10000,100)\n\nYALMIP (s)\n2.40 \u00b1 0.18\n3.29 \u00b1 0.05\n3.34 \u00b1 0.03\n7.92 \u00b1 0.17\n8.53 \u00b1 0.17\n16.44 \u00b1 0.44\n19.16 \u00b1 0.48\n65.87 \u00b1 1.54\n113.94 \u00b1 2.05\n287.67 \u00b1 2.67\n283.25 \u00b1 18.98\n1165.40 \u00b1 26.52\n\nNon-Adaptive (s) Adaptive (s) Ratio\n37\n54\n44\n55\n36\n67\n51\n206\n243\n451\n563\n852\n\n0.06 \u00b1 0.02\n0.14 \u00b1 0.03\n0.21 \u00b1 0.03\n0.58 \u00b1 0.16\n0.60 \u00b1 0.03\n0.96 \u00b1 0.07\n1.69 \u00b1 0.11\n2.40 \u00b1 0.15\n3.84 \u00b1 0.20\n7.08 \u00b1 0.66\n5.03 \u00b1 0.76\n19.75 \u00b1 3.74\n\n0.07 \u00b1 0.02\n0.06 \u00b1 0.01\n0.08 \u00b1 0.01\n0.14 \u00b1 0.01\n0.24 \u00b1 0.01\n0.25 \u00b1 0.02\n0.38 \u00b1 0.01\n0.32 \u00b1 0.01\n0.47 \u00b1 0.04\n0.64 \u00b1 0.03\n0.50 \u00b1 0.02\n1.37 \u00b1 0.12\n\nWe also tested our proposed method on the real datasets a1a-a9a downloaded from LIBSVM1. Note\nthat the data matrices from these datasets are ill-conditioned and highly sparse, which should be\ncontrasted with the well-conditioned and dense ones in the synthetic setting. Table 2 shows the\ncomparison of CPU times on the real datasets. We observe that our methods work exceptionally well,\nespecially in the large-scale case (i.e., a9a).\nAs standard ADMM-type algorithms are very sensitive to the choice of penalty parameters, it is hard\nfor us to tune the optimal penalty parameter for each subproblem with different \u03bb. Thus, we use a\nconstant penalty parameter for the non-adaptive case. In fact, since we do not need to perform a\ncareful penalty parameter selection in our method, we can achieve an even greater speed by using an\nadaptive penalty strategy. Moreover, it is worth noticing that our approaches achieve higher-accuracy\nsolutions compared with the YALMIP solver in all the experiments.\n\n1https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html\n\n7\n\n\fTable 2: CPU time comparison: LP-ADMM vs. YALMIP (used in [19]) on UCI adult datasets\n\nDataset\n\na1a\na2a\na3a\na4a\na5a\na6a\na7a\na8a\na9a\n\nSamples\n\n1605\n2265\n3185\n4781\n6414\n11220\n16100\n22696\n32561\n\nData statistics\n\nCPU time (s)\n\nFeatures YALMIP Ours\n2.93\n3.53\n4.26\n4.56\n4.39\n4.68\n5.41\n5.81\n7.08\n\n25.63\n39.20\n57.79\n105.32\n155.42\n413.65\n738.12\n1396.45\n2993.30\n\n123\n123\n123\n123\n123\n123\n123\n123\n123\n\nRatio\n\n9\n11\n14\n23\n35\n88\n137\n240\n423\n\nFigure 2: CPU time comparison with YALMIP using interior point algorithm\n\n5.2 Ef\ufb01ciency of LP-ADMM for \u03b2-Subproblem\nTo further demonstrate the ef\ufb01cacy of our proposed LP-ADMM on the \u03b2-subproblem, we present an\nempirical comparison of our algorithm with other \ufb01rst-order methods in the synthetic setting. The\nimplementation details are given as follows: \u03bb is regarded as a constant (i.e., \u03bb = 0.1), the DRLR\nmodel parameters are the same as in Section 5.1, and the \ufb01rst-order methods used include\n\n(a) Two-block Standard ADMM (SADMM) [3]: for both \u03b2- and \u00b5-updates, we perform exact\nminimization, which are done using accelerated projected gradient descent and semi-smooth\nNewton method [13], respectively (pseudo-codes are given in Appendix B);\n\n(b) Primal-Dual Hybrid Gradient (PDHG) [4];\n(c) Linearized-ADMM (LADMM): all the ingredients are the same as LP-ADMM, except that the\n\n\u00b5-update involves the classic quadratic term;\n(d) Projected Subgradient Method (SubGradient).\n\nThe convergence curves for various synthetic cases are shown in Figure 3. The performance of our\nmethods signi\ufb01cantly dominates those of other methods, which agrees with our theoretical \ufb01ndings in\nSection 4. Compared with LADMM, we show practical advantages of the \ufb01rst-order local model for\nthe \u00b5-update. In addition, LP-ADMM and Adaptive LP-ADMM have similar performance in small\ninstances, but the latter has better performance in large instances. In summary, we have demonstrated\nthat the usefulness and ef\ufb01ciency of all the components in our \ufb01rst-order algorithmic framework. In\nparticular, as the data matrices in the real datasets are ill-conditioned, all the baseline approaches\ncannot achieve a high accuracy but our proposed LP-ADMM can.\n\n5.3 Test Data Performance of the DRLR Model\n\n(cid:80)N\nIn this subsection, we compare the test data performance of the DRLR model with two classic models,\nnamely, Logistic Regression (LR) and Regularized Logistic Regression (RLR). The latter refers to\ni=1 (cid:96)\u03b2(\u02c6xi, \u02c6yi) + \u0001(cid:107)\u03b2(cid:107)\u221e. If the training data labels are error-free (which corresponds to\nmin 1\n\u03ba = \u221e), then the DRLR model reduces to RLR [19]. We use grid search with cross-validation to\nN\n\n8\n\n123456789101112Various synthetic settings in order of scale 020040060080010001200CPU time(s)0100200300400500600700800900RatioThe CPU time Comparision on Synthetic DataYAMLIPOur Method (Adaptive)Ratio0.511.522.53Number of Samples104050010001500200025003000CPU time(s)050100150200250300350400450RatioThe CPU time Comparision on UCI Adult DatasetYAMLIPOur MethodRatio\fFigure 3: Comparison of LP-ADMM with other \ufb01rst-order methods on \u03b2-subproblem: y-axis is the\nsub-optimality gap log(F k \u2212 F \u2217); \u201ctotal iterations\u201d refers to that taken by Adaptive ADMM\n\nselect the parameters of the DRLR model (i.e., \u0001 = 0.3, \u03ba = 7). In addition, we randomly select\n60% of the data to train the models and the rest to test the performance. As before, all the results\nreported here were averaged over 30 independent trials. Table 3 shows the average classi\ufb01cation\naccuracy on the test data. We observe that the DRLR model consistently outperforms the two classic\nmodels over all datasets. Thus, the distributionally robust optimization approach opens a new door\nfor ameliorating the poor test performance in practice.\n\nTable 3: Average classi\ufb01cation accuracy on test datasets\n\nDataset\n\na1a\na2a\n\nMNIST(0 vs 3)\nMNIST(0 vs 4)\nMNIST(0 vs 6)\nMNIST(2 vs 3)\nMNIST(2 vs 5)\nMNIST(5 vs 8)\nMNIST(5 vs 9)\nMNIST(6 vs 9)\n\nLR\n\n83.13%\n83.68%\n99.15%\n99.39%\n97.88%\n96.87%\n96.67%\n94.80%\n97.21%\n99.54%\n\nRLR (\u03ba = \u221e) DRLR (\u03ba = 7)\n\n83.82%\n83.93%\n99.45%\n99.54%\n98.86%\n97.31%\n97.45%\n94.91%\n98.11%\n99.59%\n\n84.01%\n84.24%\n99.55%\n99.75%\n98.92%\n97.40%\n97.77%\n95.18%\n98.47%\n99.80%\n\n6 Conclusion and Future Work\n\nIn this paper, we have proposed a \ufb01rst-order algorithmic framework to solve a class of Wasserstein\ndistance-based distributionally robust logistic regression (DRLR) problem. The core step of our\nframework is the ef\ufb01cient solution of the \u03b2-subproblem. Towards that end, we have developed a\nnovel ADMM-type algorithm (the LP-ADMM) and established its sublinear rate of convergence. We\nhave also conducted extensive experiments to verify the practicality of our framework. It is worth\nnoting that problem (1.2) actually enjoys the Luo-Tseng error bound property when the transport\ncost is induced either by the (cid:96)1-norm or (cid:96)\u221e-norm [16]. However, this does not immediately imply\nthe linear convergence of our proposed LP-ADMM, as the method involves both primal and dual\nupdates. Thus, it is interesting to see whether our proposed LP-ADMM or some other practically\nef\ufb01cient \ufb01rst-order methods can provably achieve a linear rate of convergence when applied to the\n\u03b2-subproblem.\n\n9\n\n\fReferences\n[1] Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust Wasserstein Pro\ufb01le Inference and\n\nApplications to Machine Learning. Journal of Applied Probability, 56(3):830\u2013857, 2019.\n\n[2] Jose Blanchet, Karthyek Murthy, and Fan Zhang. Optimal Transport Based Distributional-\nly Robust Optimization: Structural Properties and Iterative Schemes. arXiv preprint arX-\niv:1810.02403, 2018.\n\n[3] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed\nOptimization and Statistical Learning via the Alternating Direction Method of Multipliers.\nFoundations and Trends R(cid:13) in Machine Learning, 3(1):1\u2013122, 2011.\n\n[4] Antonin Chambolle and Thomas Pock. A First-Order Primal-Dual Algorithm for Convex\nProblems with Applications to Imaging. Journal of Mathematical Imaging and Vision, 40(1):120\u2013\n145, 2011.\n\n[5] Wanyou Cheng, Qunfeng Liu, and Donghui Li. An Accurate Active Set Conjugate Gradient\nAlgorithm with Project Search for Bound Constrained Optimization. Optimization Letters,\n8(2):763\u2013776, 2014.\n\n[6] Wei Deng and Wotao Yin. On the Global and Linear Convergence of the Generalized Alternating\n\nDirection Method of Multipliers. Journal of Scienti\ufb01c Computing, 66(3):889\u2013916, 2016.\n\n[7] Rui Gao, Xi Chen, and Anton J Kleywegt. Wasserstein Distributional Robustness and Regular-\n\nization in Statistical Learning. arXiv preprint arXiv:1712.06050, 2017.\n\n[8] Xiang Gao and Shu-Zhong Zhang. First-Order Algorithms for Convex Optimization with\nNonseparable Objective and Coupled Constraints. Journal of the Operations Research Society\nof China, 5(2):131\u2013159, 2017.\n\n[9] Mingyi Hong and Zhi-Quan Luo. On the Linear Convergence of the Alternating Direction\n\nMethod of Multipliers. Mathematical Programming, 162(1-2):165\u2013199, 2017.\n\n[10] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A\nDual Coordinate Descent Method for Large\u2013Scale Linear SVM. In Proceedings of the 25th\nInternational Conference on Machine Learning (ICML 2008), pages 408\u2013415, 2008.\n\n[11] Changhyeok Lee and Sanjay Mehrotra. A Distributionally-Robust Approach for Finding Support\nVector Machines. Manuscript, available at http://www.optimization-online.org/DB_\nHTML/2015/06/4965.html, 2015.\n\n[12] Min Li, Defeng Sun, and Kim-Chuan Toh. A Majorized ADMM with Inde\ufb01nite Proximal Terms\nfor Linearly Constrained Convex Composite Optimization. SIAM Journal on Optimization,\n26(2):922\u2013950, 2016.\n\n[13] Xudong Li, Defeng Sun, and Kim-Chuan Toh. A Highly Ef\ufb01cient Semismooth Newton\nAugmented Lagrangian Method for Solving Lasso Problems. SIAM Journal on Optimization,\n28(1):433\u2013458, 2018.\n\n[14] Zhouchen Lin, Risheng Liu, and Zhixun Su. Linearized Alternating Direction Method with\nAdaptive Penalty for Low-Rank Representation. In Advances in Neural Information Processing\nSystems, pages 612\u2013620, 2011.\n\n[15] Fengqiao Luo and Sanjay Mehrotra. Decomposition Algorithm for Distributionally Robust\nOptimization Using Wasserstein Metric with an Application to a Class of Regression Models.\nEuropean Journal of Operational Research, 278(1):20\u201335, 2019.\n\n[16] Zhi-Quan Luo and Paul Tseng. Error Bounds and Convergence Analysis of Feasible Descent\n\nMethods: A General Approach. Annals of Operations Research, 46(1):157\u2013178, 1993.\n\n[17] Hongseok Namkoong and John C Duchi. Stochastic Gradient Methods for Distributionally\nIn Advances in Neural Information Processing\n\nRobust Optimization with f-Divergences.\nSystems, pages 2208\u20132216, 2016.\n\n10\n\n\f[18] Hongseok Namkoong and John C Duchi. Variance-based Regularization with Convex Objectives.\n\nIn Advances in Neural Information Processing Systems, pages 2971\u20132980, 2017.\n\n[19] Soroosh Sha\ufb01eezadeh-Abadeh, Peyman Mohajerin Esfahani, and Daniel Kuhn. Distributionally\nRobust Logistic Regression. In Advances in Neural Information Processing Systems, pages\n1576\u20131584, 2015.\n\n[20] Soroosh Sha\ufb01eezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Regularization\n\nvia Mass Transportation. Journal of Machine Learning Research, 20(103):1\u201368, 2019.\n\n[21] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying Some Distributional Robustness\n\nwith Principled Adversarial Training. arXiv preprint arXiv:1710.10571, 2017.\n\n[22] Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright, editors. Optimization for Machine\nLearning. Neural Information Processing Series. MIT Press, Cambridge, Massachusetts, 2012.\n\n[23] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Statistics for Engineering and\n\nInformation Science. Springer Science+Business Media, LLC, second edition, 2000.\n\n[24] Lin Xiao and Tong Zhang. A Proximal Stochastic Gradient Method with Progressive Variance\n\nReduction. SIAM Journal on Optimization, 24(4):2057\u20132075, 2014.\n\n[25] Yangyang Xu. Accelerated First-Order Primal-Dual Proximal Methods for Linearly Constrained\n\nComposite Convex Programming. SIAM Journal on Optimization, 27(3):1459\u20131484, 2017.\n\n[26] Junfeng Yang and Xiaoming Yuan. Linearized Augmented Lagrangian and Alternating Direction\nMethods for Nuclear Norm Minimization. Mathematics of computation, 82(281):301\u2013329,\n2013.\n\n[27] Wotao Yin. Analysis and Generalizations of the Linearized Bregman Method. SIAM Journal on\n\nImaging Sciences, 3(4):856\u2013877, 2010.\n\n[28] Zirui Zhou and Anthony Man-Cho So. A Uni\ufb01ed Approach to Error Bounds for Structured\nConvex Optimization Problems. Mathematical Programming: Series A and B, 165(2):689\u2013728,\n2017.\n\n11\n\n\f", "award": [], "sourceid": 2178, "authors": [{"given_name": "JIAJIN", "family_name": "LI", "institution": "The Chinese University of Hong Kong"}, {"given_name": "SEN", "family_name": "HUANG", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Anthony Man-Cho", "family_name": "So", "institution": "CUHK"}]}