NeurIPS 2020

Stage-wise Conservative Linear Bandits

Meta Review

The paper addresses issues of balancing exploration and exploitation when faced with a safety or minimum-utility requirement - a problem that has emerged recently within the online learning community. The reviewers all agree that the paper shows significant novelty and does a comprehensive job in terms of algorithm design and regret analysis. Some of the reviewers' concerns included the possibly restrictive nature of the assumptions made on the decision space and knowledge of problem-dependent structural constants, more explicit experiments to quantify the amount of exploration, and other technical clarifications, most of which were addressed by the author response satisfactorily. A broader, lingering concern that I have about the paper (and, in general, with this line of work) is that beyond the 'usual' application settings of recommendation systems and clinical trials, there is no attempt made to connect the technical formulation to a concrete and relatable problem, showing how the constraints actually arise in an organic manner. For instance, in clinical trials, the decision space (treatments) is likely finite, and low regret does not seem to be a logical objective compared to fast inference (best (safe) arm identification after an experimentation horizon). I would urge the author(s) to reflect upon this to situate the work better.