Paper ID: | 2178 |
---|---|

Title: | A First-Order Algorithmic Framework for Wasserstein Distributionally Robust Logistic Regression |

This paper derives a novel algorithm for solving the dual DRLR problem when \kappa < \infty (i.e. the labels may change during transport). The algorithm performs a golden section search for \lambda, within which the sub-problem for optimal \beta, fixing \lambda, is solved by an ADMM algorithm. The ADMM algorithm differs from typical ADMM approaches in two ways: (1) the \beta-update is ill-conditioned, requiring a careful choice of iterative method, while (2) the auxiliary \mu update is locally strongly convex, enabling the use of a first-order (not quadratic) approximation with a fixed step size. I see three theoretical contributions: 1. An upper bound on optimal \lambda, stated in Proposition 1, which enables the golden section search. 2. The LP-ADMM iterative method for solving the \beta subproblem, stated as Algorithm 1. 3. O(1/k) convergence of LP-ADMM, stated in Theorem 4.2. And two empirical results: 1. Substantial speedups (wall clock) over standard YALMIP solver on the full DRLR problem, for both adaptive and nonadaptive step size selection strategies for the inner \beta subproblem, using both synthetic and real data. 2. Fast convergence of LP-ADMM on the \beta subproblem, as compared to several standard iterative methods, using synthetic data. Strengths: 1. Fast solution of DRLR is potentially useful and (to my knowledge) previously unaddressed. 2. The paper is fairly easy to follow, with the exception of the notation change between sections. 3. I like Figure 1, which shows the relationship of LP-ADMM to the DRLR problem. 4. The empirical speedups appear to be substantial. Weaknesses: 1. The change in notation from the DRLR sections (1, 3, and 5) to the generic LP-ADMM sections (2 and 4) is potentially confusing. Is there a way to make the notation consistent? (Or at least disjoint?) 2. Section 5.3 bears little relation to the rest of the paper and seems to show a result nearly identical to Table 1 in [1] (using different datasets). Should it be omitted entirely? 3. Section 5.2 uses only a single synthetic problem to argue for faster convergence of LP-ADMM as compared to standard methods. Could you show something like number of iterations to convergence under various tolerances, using multiple real datasets? Or is there another way to show that the standard methods are problematic, given real data? This claim needs more evidence. 4. There's no analysis of the relative contributions of the golden section search versus the LP-ADMM. I.e. if I plug in a standard solver for the \beta subproblem (in place of LP-ADMM), does golden section search still yield a substantial speedup? And is there a reason to believe golden section search will outperform other univariate search methods? Less major weaknesses: 1. There's no real discussion of the choice between solvers for the \beta update within the ADMM. Which should I choose and when? Could you expand the discussion on lines 129-132? At least: What are the tradeoffs between the methods? (Presumably efficiency, under different assumptions.) Could you illustrate the difference empirically? Are there practical rules involving, e.g., condition numbers for the data matrices? 2. In Section 5.1, there's no mention of the chosen iterative method for the \beta update in the ADMM. 3. In Section 5.1, there's no comparison of adaptive to nonadaptive penalty using real data (Figure 3). Also, it's unclear which of these you're using in Figure 3. Typos, etc.: * The horizontal axis in Figure 2 isn't labeled. * There's a typo in Remark 6.9 (in the appendix): "... it statisfies ..." [1] Shafieezadeh-Abadeh, Esfahani, Kuhn. Distributionally Robust Logistic Regression. NIPS 2015. ---- UPDATE: I have read the authors' response and I appreciate their explanations and willingness to include additional content. I will not be changing my score.

Originality: The WDR-LR problem is timely and important as it can offer advantages over other logistic regression formulations, as demonstrated by previous works. The proposed framework for wrapping an ADMM method in a line search method to address this problem appears to an appropriate and novel approach. It is also a nice motivation for the development of the new LP-ADMM method variant. The main novelty of the LP-ADMM method is that the y variables uses a first-order instead of second-order approximation, which is justified by the strong convexity of that subproblem. While the change may appear somewhat minor, this leads to a more aggressive/robust strategy with demonstrated benefits based on the numerical experiments. The authors do a good job for the most part on the literature review, although they may want to discuss some more works on applying first order methods to different types of DRO problems than those studied herein, see [1] below for example. Clarity: The paper is for the most part written well and is well organized. Quality: To the best of my knowledge, the mathematical analysis and proofs are correct. Although the O(1/K) convergence rate does not improve on other types of related ADMM methods, the numerical experiments justify its superior performance for the WDR-LR problem. In particular, the authors demonstrate improved performance over both off he shelf YALMIP solver as well as other existing first order/ADMM methods for solving the beta subproblem. Significance: The WDR-LR problem is timely and important as it can offer advantages over other logistic regression formulation, as demonstrated by previous works. Still, since the problem is somewhat narrow and the formulation is more complicated than ridge logistic regression for example, it may be difficult for this work to have a very strong amount of practical impact and/or follow up works. Nevertheless, this paper appears to be an important first step in addressing Wasserstein DRO type problems with large-scale optimization algorithms. [1] Namkoong, Hongseok, and John C. Duchi. "Stochastic gradient methods for distributionally robust optimization with f-divergences." Advances in Neural Information Processing Systems. 2016.

1. It would benefit the reader if the result is stated for the general norm for beta, instead of merely mentioning "the framework is general enough to extend to other norms". 2. What is the relationship between the KKT point of the problem (2.1) and the global solution of the original problem? In particular, if we get an eps-optimal solution of (2.1), what is the optimality gap of the original DRLR problem? 3. DRO typically works better the empirical risk minimization when the sample size is relatively small. Several numerical experiments use a large dataset such as MNIST. Can you run the experiment with a much smaller subset of the data and make corresponding comparisons? 4. What is the information-theoretic lower bound of the DRLR problem? This is related to the last sentence in the conclusion section. 5. A minor comment: can you state the problem in terms of the original variables beta, lambda, and s, as it facilitates the reader?