Review for NeurIPS paper: Large-Scale Methods for Distributionally Robust Optimization

NeurIPS 2020

Large-Scale Methods for Distributionally Robust Optimization

Review 1

Summary and Contributions: The paper studies the use of batch stochastic gradient methods to solve large scale DRO problems. In these scenarios, we face two problems: (1) Stochastic gradient estimates of DRO problems are biased; (2) Due to the size of large-scale problems, the convergence rate of the methods used to tackle them should not depend on either the number of parameters d or number of training examples N. The authors tackled problem (1) by defining a surrogate objective for which the gradient estimates are unbiased. Then, by carefully bounding the difference between the true and surrogate objectives as a function of the batch size n, the authors are able to give optimality bounds for the true cost by optimizing the surrogate cost, using a large enough batch size. Moreover, for some classes of robust risks, the authors also bound the variance of the gradient estimates. This allows them to use an accelerated version of the stochastic gradient method which achieves tighter convergence bounds. Regarding problem (2), all the bounds given do not depend on either d or N, which makes the results suitable for large-scale applications. The authors also give lower bounds for the classes of DRO problems studied, proving the (sub)optimality of the proposed methods. Finally, these methods are used in experimental setups, which apparently corroborate with the theoretical results presented. %%% post rebuttal comments: %%% I appreciate the authors' effort to prepare the response letter. The authors satisfactorily addressed my comments concerning the bias in gradient estimates and numerical comparison with the existing approaches. I am happy with the response and increase my score to 7. %%%%%%%%%%%%%%%%%%%

Strengths: (1) The paper tackles an interesting problem. DRO, as well as methods for large-scale optimization (e.g. stochastic gradient methods), have increasingly gained the attention of the ML research community for the past few years. Developing methods to solve large-scale DRO problems efficiently, with theoretical guarantees, could be applicable and impact many different areas. (2) For (most of) the DRO problems studied in the paper, the authors present both upper and lower bounds for the convergence rate of the proposed approaches. (3) The experiments presented seem to corroborate with the theoretical results of the paper. The authors also provide the code used in the experiments. Moreover, their approach is shown to be easily integrated with PyTorch, which improves the applicability of these methods in exiting optimization pipelines.

Weaknesses: In my opinion, the paper does not present a major weakness. See below for minor comments on the paper.

Correctness: (1) The proofs in the supplementary material were not checked in detail, thus I cannot certify the soundness of the theoretical claims. (2) The authors seemed transparent with respect to their empirical methodology, that is, how the datasets were preprocessed and how hyperparameters were tuned for the experiments.

Clarity: In my opinion, the paper is mostly clear and well written, modulo some minor comments (see below).

Relation to Prior Work: I think the authors did a good job reporting the related literature, for example, when presenting a comparison with known results in Table 1.

Reproducibility: Yes

Additional Feedback: General comments and typos: (1) In Line 37, it is stated that it is straightforward to obtain unbiased gradient estimates for the surrogate risk. In contrast, in Line 146 it is stated that the gradient of an empirical estimate of L on a mini-batch of size n is clearly a biased estimate. It would be nice to explain a little better why these two facts are so clear/straightforward, since they are central to the discussion presented in the paper. (2) Line 61: it is not clear what is meant by “error floor”. (3) Line 75: at this point in the paper, it is not clear what is meant by “s \in S”. (4) Line 90: shouldn’t it be “\alpha^{-1}\epsilon^{-2}” (5) Line 152: the hyperlinked reference to “Nesterov” looks strange. I think it would be better to use a regular reference, that is, [46]. (6) As explained in the paragraph starting in Line 231, some of the results presented in Table 1 follow when \nu, \lambda and \alpha respect some inequalities. Can these conditions be assumed without loss of generality? If not, I think it needs to be mentioned that the results in Table 1 follow given some extra conditions. (7) One way to improve the empirical evaluation of the proposed methods would be to compare their performance with other methods from the literature (for example, the ones presented in Table 1). Even though those methods have worse theoretical guarantees, it would be interesting to see how they compare in practice. (8) In the last two equations of Line 910, isn’t “>=” supposed to be “<=” in both equations? (9) Line 1221/1222: what is the justification behind using \Delta L_{CVaR} rather than \Delta L_{kl-CVaR}, even though it goes against the theory?

Review 2

Summary and Contributions: ## UPDATE ## I'm grateful to the authors for kindly answering my question. I've read the response and will maintain my score. A minor suggestion: It would be helpful for practitioners if authors could present pseudocode of the algorithm in the appendix. ############ The paper proposes and analyzes distributionally robust optimization (DRO) algorithms. The analysis shows that the number of gradient evaluations the algorithms require is independent of the training set size and number of parameters, which makes the algorithms scalable to large instances. The analysis is based on the bound of bias caused by the use of empirical robust objective functions. A gradient estimation method with multi-level Monte Carlo (MLMC) is also analyzed, which yields optimal convergence results. The theoretical results are validated with experiments.

Strengths: - The paper makes many significant theoretical contributions to DRO, an important subject that is relevant to the NeurIPS community. - The experiments strongly support the theoretical results. - The advantages over previous work are clearly described.

Weaknesses: - In experiments, comparisons with other existing algorithms (e.g., [Curi et al. (2019); Namkoong and Duch (2016)]) are not presented. - For readers who are unfamiliar with this are (like me), it is hard to intuitively understand the key reason why such bounds, which are independent of the training set size and the number of parameters, are possible.

Correctness: It was hard to check the correctness, but no errors were found as far as I can see.

Clarity: The paper is well written and the results are clearly described.

Relation to Prior Work: Relation to previous work is clearly described.

Reproducibility: Yes

Additional Feedback: I would appreciate it if the authors could briefly describe the key point that makes it possible to obtain the bounds shown in Table 1 and why existing studies failed to do so.

Review 3

Summary and Contributions: The paper considers distributionally robust optimization problems with CVaR and \chi^2 divergence ambiguity sets. The authors first show that a biased gradient estimate, based on a mini batch approach, can be used to find an epsilon optimal solution with lower number of gradient evaluations compared to naive gradient based methods. The sample complexity then improves by using a better gradient estimate based on a Monte Carlo approach. =================================================== [Update after reading rebuttal]: I would like to thank the authors for answering my concerns. As a suggestion for the final version, it will be very interesting to include a plot comparing the execution time of the vanilla SGD with your proposed method. In practice, SGD variants with adaptive sampling scheme cannot outperform vanilla SGD with uniform sampling in terms of execution time. This is mainly due to time complexity of the sampling oracle. I was wondering if the authors can add a short remark or discussion about this concern. All in all, my evaluation remains the same.

Strengths: I think the paper certainly is a strong submission. In particular, the use of stochastic gradient descent with a more sophisticated gradient oracle in the context of distributionally robust optimization problems (with \chi^2 divergence and CVaR ambiguity sets) is new and very interesting. I would also replace the accelerated SGD algorithm with the simple averaged SGD algorithm since the averaged SGD converges to the minimizer of the problem with the rate O(1/T) + O(1/sqrt(T)); see for example Theorem 6.3 of [Bubeck 2015]. Note that using the accelerated SGD algorithm only improves the rate to O(1/T^2) + O(1/sqrt(T)). ========================================================= Bubeck, Sébastien. "Convex Optimization: Algorithms and Complexity." Foundations and Trends® in Machine Learning 8.3-4 (2015): 231-357.

Weaknesses: I found the paper a bit hard to read because the authors analyzed several different cases at the same theorem. However, I want to thank them to really push all theoretical results and consider the most general case. I also found one of the assumptions in the paper restrictive. Having bounded loss function simply implies that the feasible set X has to be compact. This is certainly fine because both average and accelerated SGD algorithms heavily relies on the project step on set X. My suggestion is to replace bounded loss assumption with bounded feasible set because together with Lipschitz continuity you will have a bounded loss function.

Correctness: I checked the proof, and to the best of my knowledge, all results are correct.

Clarity: The paper is very well-written.

Relation to Prior Work: The connection to previous studies are clearly made. I particularly really like Table 1 in the paper.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: ##### Update ##### I have read the rebuttal and will maintain my score. ################## The authors offer an analysis of applying SGD to the primal problem of distributional robust optimization. While the mini-batch gradient estimator is an biased estimate for the true gradient, it is an unbiased estimate for a surrogate function which is proven to be within epsilon distance from the original objective as a function of the number of samples in the minibatch. The authors show convergence results for difference uncertainty sets of interest and conclude with experiments exhibiting convergence of standard SGM applied to DRO.

Strengths: While I have not confirmed correctness of the proofs (I assume a much longer paper will come out of this based on the long supplement), the main strength of the work is in the bounds on convergence rates that are claimed as they are independent of the number of points in the training set and the number of parameters. The authors note other works that use minibatch gradient estimates but do not offer any such analysis. DRO is of great relevance to the NeurIPS community so faster algorithms for such problems are important.

Weaknesses: The only weakness is in the experiments. Convergence is shown for the proposed method, but not compared against previous methods such as Dual SGM or Stochastic primal-dual. This could include various methods noted by the authors that do not have a convergence analysis. Furthermore, the reader would greatly benefit from seeing the benefit of DRO that exemplify how it can achieve more robust models on test data.

Correctness: I have not verified correctness.

Clarity: The paper is well-written. I recommend some changes to consider below.

Relation to Prior Work: It is clearly discussed. Various other works that use mini-batch gradient estimators are discussed, but it is claimed they do not have the convergence bounds stated here.

Reproducibility: Yes

Additional Feedback: In my opinion, the proof sketches do not add much to the reader. The space could be better used with other discussions. For example, define the gradients as done in the supplement in A.1.5 and explain explicitly why the gradient is biased but unbiased for the surrogate (rather than saying "Clearly").