Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper studies a linear stochastic bandit problem where the learner observes only a distribution over the context vectors and the true context vector is a sample from this distribution. Authors propose a UCB-based algorithm and show that the uncertainty in the context vector leads to only a small additive regret. They also consider a variant of the problem where the realized context is observed after the learner takes an action. A kernelized version of the problem is also studied, and finally experimental results are reported. The setting of the current paper is very similar to the setting of the following paper: “Profile-Based Bandit with Unknown Profiles” by Sylvain Lamprier, Thibault Gisselbrecht, Patrick Gallinari; JMLR 19(53):1-40, 2018. The above paper studies a less restrictive version of the problem where only a sample from the context distribution is observed. I am concerned about the significance of the current paper in light of this earlier result. Please discuss the above paper in your related work section and explain the differences and similarities with their setting. Minor comment: The paragraph after equation (1): authors discuss two different notions of regret. They argue that one notion of regret is too difficult and leads to \Omega(T) regret. But in that example, the other notion of regret leads to 0 regret and the problem becomes trivial. You need a better example here.
- Originality: There are two main contributions of this paper. The first one is to consider a CB setting where the context is not known exactly, but only in a distributional sense. The second one is that the author gives an effective algorithm under this setting, analyzing the regret theoretically, and empirically examine the performance on three datasets. - Significance: This paper seems to be a useful contribution to the CB literature, and will motivate some new algorithms in this new setting. Also, the author empirically examines how many data will be suitable to form a context set. - Quality: The theorem is clearly-stated. And I am a little bit concerned about the weak baseline. Even the paper justifies that if we compare with the original baseline, we will get an order T regret without knowing context realization. I am wondering is it possible to give a stronger regret analysis (compared with the original strong baseline) under some specific cases (say with some restriction on the context distribution and reward structure)? For the experiments, in the Movielens case, it is interesting to see the hidden (with sample 100) is better than hidden (expected), any justifications? Also for the Crop Data case, it is better to include some justifications about the very small advantage of using exact context. -Clarity: The paper is well-written and it will be great if there are more explanation of the results from the experiments.
This paper is interesting and well written in general. To the best of my knowledge, this is the first paper that studies contextual (stochastic) bandits with context distributions, which is an interesting problem. After a high-level check, I think the analysis is (high-level) technically correct. The experiment results are also interesting. My only major concern is that this paper might be relatively technically straightforward based on existing literature. In particular, the proof for Theorem 1 (the major theoretical result) seems to be quite straightforward by identifying some Martingale difference sequences. Please clarify! ----------------------------------------------------------------------------------- I have read the rebuttal and prefer to keep my score.