__ Summary and Contributions__: This paper proposes an aadaptive sampling algorithm for optimizing CVaR of a loss function. The sampler uses EXP3 on top of k-DDP to maximize the loss.

__ Strengths__: The framework can be practical in robust/fair learning.

__ Weaknesses__: The algorithm isn't as elegant as one would hope. Well-known existing methods are stacked to get expected results.
Sampling only one data point in one iteration seems impractical. Is it possible to use a minibatch?

__ Correctness__: The mathematical claims of the paper seems correct.

__ Clarity__: Yes.

__ Relation to Prior Work__: This has a flavor similar to boosting/GAN in that the learner is playing a game against sampler which forces the learner to be more robust.
Is there any connection to AdaBoost? How about GANs? Combinatorial multi-armed bandit literature?

__ Reproducibility__: Yes

__ Additional Feedback__: Did you consider combinatorial bandit algorithms instead of k-DPP + EXP3 ?
How many repeat runs did you do? Figure 3 has really large error bars. Looking at the loss distributions of AdaCVaR vs Mean would be a more interesting visualization than the test score bars, both for Figure 2 and Figure 3.
Figure 2 "Classification" results are hard to interpret. Does that mean AdaCVaR prefers uniform-like predictions on easy data?

__ Summary and Contributions__: The paper proposes an adaptive sampling algorithm to optimize the conditional value-at-risk, a risk measure that focuses on the elements with the highest loss. The approach proposes an alternative sampling method, that bypass the high variance problem of the standard batch-based estimators of the conditional value-at-risk. The theoretical guarantees are usual generalization bounds or average regret bounds. The approach is supported by numerical experiments.

__ Strengths__: The paper addresses an important problem from the perspective of a minimax game between a learner and a sampler. It is well-written and accessible. I believe that the claim is new, and the empirical performance evaluation is done with other papers on CVaR optimization.
---- After rebuttal -----
I read ofter reviews, the author's response slightly modified my score.

__ Weaknesses__: Since the paper features both ideas from statistical learning and bandit algorithms, its understanding requires theoretical background in both fields.

__ Correctness__: The claims and method, as well as the empirical methodology, are correct.

__ Clarity__: The paper is well-written, but there are some complicated or undefined definitions.
Precisely:
- l.95 \hat{C} is not defined before it is presented.
- l.120 could you give a reference for the 'partial bandit feedback model' ?
- Lots of notation abuse, e.g. \mathbb{E}_q
- In Algorithm 1, \hat{L}_t is probably the estimator of L_t in R^N, but it is assigned a real value,
- In Algorithm 1, what is \sim_{u.a.r.} ?
- l. 160 defines W_{I, t} while l. 167 defines W_{t, I} which is quite confusing.
Typos:
- l 264: 'Hence it finds'
- avoid inline fractions (\frac{}{})

__ Relation to Prior Work__: The relation to prior work is well-explained.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: ## Update ##
Thank you very much for kindly responding to my comments. In my opinion, the proposed method itself is interesting and well analyzed. On the other hand, the method is not motivated enough in light of the DRO context: How the proposed approach is advantageous over existing methods with other DRO sets? Is the theoretical result stronger than those of existing methods? All in all, I feel the paper is "marginally" above, and I will maintain my score.
###########
This paper proposes an adaptive sampling method that enables us to stochastically optimize CVaR of loss function values, which involves two challenges: the rejection of many sampled data points and exploding gradients. The underlying idea for avoiding them is to write the problem in a min-max form and to use no-regret algorithms. A naive implementation of a sampler's online algorithm, which samples data points, incurs an exponential computation cost. The authors avoid this problem with sampling from a k-DPP kernel. The proposed method is validated with experiments.

__ Strengths__: - The resulting CVaR value is theoretically bounded in expectation.
- The empirical effectiveness is confirmed with extensive experiments.
- Risk-averse learning is an important topic relevant to the NeurIPS community.

__ Weaknesses__: - In the context of robust optimization, the idea of using min-max formulations and no-regret algorithms is prevalent. In this sense, the proposed method appears to be just a CVaR variant of the line of work. Thus it is somewhat weak in terms of technical novelty and significance.
- As in Section 2, Namkoong & Duchi (2016) proposed a stochastic DRO method for DRO sets with Cressie-Read f-divergence, and it is stated that the DRO set considered in this work is different. However, it is unclear why employing the different DRO set is advantageous.

__ Correctness__: The claims and methods seem to be correct, although I did not check the proofs in Appendix.

__ Clarity__: The paper is mostly well-written and the statement is clear. The second paragraph in Section 1 is a little bit misleading, where the authors raised "non-convexity" and "high variance" as current issues. As far as I can see, no theoretical results that resolve the issues are presented in the paper.

__ Relation to Prior Work__: The relation to prior work is well described, but the authors should elaborate more on differences from other DRO methods (e.g., Namkoong & Duchi (2016)). Comparisons with other combinatorial bandit algorithms (e.g., [Combes et al., Combinatorial Bandits Revisited, NIPS2015]) should also be presented to explain why the k-DPP-based method is advantageous.

__ Reproducibility__: Yes

__ Additional Feedback__: Is it possible to obtain any theoretical guarantee that bounds the variance of CVaR (and its gradient)?