NeurIPS 2020

Online Learning in Contextual Bandits using Gated Linear Networks

Review 1

Summary and Contributions: This paper proposes a novel method for contextual bandits based on Gated Linear Networks and outperformed existing work, especially for discrete tasks.

Strengths: The proposed algorithm, Gated Linear Contextual Bandits (GLCB), is based on Gated Linear Networks (GLN). GLN could be efficiently trained because the loss is propagated directly to all units, and the loss is convex. Because of these properties of GLN, the new method GLCB is suitable for online learning. The experimental results show the effectiveness of GLCB.

Weaknesses: The execution time for inference is not provided in the paper. It might be a problem in cases like web marketing, where the agents must make decisions in millisecond order. The advantage of the proposed algorithm is clear for the discrete tasks but not for continuous tasks.

Correctness: I did not find significant problems in the explanation of the methods. I have small concerns about the experiments. Please see the feedback.

Clarity: This paper is very well written. It explains a complex algorithm well in the limited number of pages. I had to read the paper [16] to understand the Gated Linear Networks, but I think that is not a major problem.

Relation to Prior Work: To my knowledge, this paper covers important related work. This paper clearly explains how the new algorithm differs from the existing work.

Reproducibility: Yes

Additional Feedback: I would like to know how the authors selected the datasets for the experiments. I think the authors used seven datasets out of the eight datasets described in the paper [7]. The only one not used from the paper is "mushroom." Why is it excluded? Also, I may have overlooked, but the synthesized dataset "wheel" seems to have a parameter $\delta$. What is the value of $\delta$? Was it explained somewhere? (comment after rebuttal) I think the rebuttal adequately addresses my concerns.

Review 2

Summary and Contributions: The paper introduces a new online contextual bandit algorithm bases on Gated Linear Networks, a type of deep learning architecture that was motivated by the need of online supervised learning. The new algorithm is called Gated Linear Contextual Bandits (GLCB) and is endowed on a UCB-like mechanism to estimate prediction uncertainty which takes advantage in a very interesting way of how Gated Linear Networks partition the input space. This allows GLCB to compute prediction uncertainty (which is used to promote exploration similarly to UCB) with minimal computation overhead. The algorithms is validated in a battery of empirical benchmarks which demonstrate its superior performance against a 9 state-of-the art alternative algorithms.

Strengths: - Novelty: The proposed algorithm is novel in the sense of being online and still highly competitive for contextual bandit problems, even against non-online state-of-the-art alternatives - Theoretical soundness: The authors show convergence and regret guarantees of their algorithm which show for instance that the policy error tends to zero asymptotically.

Weaknesses: The paper does not provide any explanation of how the algorithm deals with continuous rewards. In the main paper, the authors only mention the CTREE algoritm, but do not explain how or even that it is combined with GLCB, and one has to go into the appendix to figure out what they mean. Still, even in the appendix there is no explicit literature reference for the CTREE algorithm.

Correctness: As far as I can tell, the paper seems correct, and the empirical evaluation solid.

Clarity: The paper could be written in a more self-contained way by explicitly reporting important pieces of information related to the implementation of the algorithm that otherwise has to be looked up in the literature. For instance, the authors do not provide a explicit expression for the gating function, which has to be found in the Gated Linear Networks papers.

Relation to Prior Work: The authors provide a clear discussion of the previous literature.

Reproducibility: Yes

Additional Feedback: POST-REBUTTAL COMMENTS: I want to thank the authors for their rebuttal letter and promising to expand on their CTREE discretization method in order to make the paper more self-contained. I also commend them for running some of the experiments recommended by the other reviewers that will nicely complement the paper. I am satisfied with the rebuttals.

Review 3

Summary and Contributions: This paper proposes a fully online learning method for contextual bandits using Gated Linear Networks (GLN). The key contributions are two parts: (1) introducing GLN to the contextual bandit setting; (2) devise a novel exploration scheme for GLN based contextual bandit.

Strengths: The methodology to do online contextual bandit is novel. Moreover, the paper is well-written and easy to follow.

Weaknesses: The rationale for why utilizing such a scheme for "Pseudocounts for GLNs" is not clearly explained. Moreover, the experimental justification seems to be not enough.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: The contribution of this paper is significant. If the experimental results are completely justified, I believe this method will become a classical method in contextual bandits. But I have three major concerns. (1) The rationale for utilizing the proposed "Pseudocounts for GLNs". Why do you use such an aggregation scheme? i.e. what's the rationale? Are there any other alternatives? (2) The experiments look very promising. But after a careful checking, I find one problem: the data size is very small as you only report experimental results till 5000. I wonder if the advantage of your proposed method can keep when the data size increases? Furthermore, what's the time cost of your experiments? I think this should be included as a comparison with other methods. I have some doubt on the time efficiency of the proposed method, especially when there are many arms. (3) Algorithm 3 is not very adequately described. I think you should include a complete algorithm for bounded, continuous rewards, rather than only report the CTREE part. I am willing to adjust my score if you can well address my concerns. (update after rebuttal) The feedback partially answers my concerns and I've raised my score to 7. But in general, it is not easy to reproduce your paper given your provided material (including your code). Hope you can present more details in the revised version.

Review 4

Summary and Contributions: Paper proposes a fast uncertainty estimation heuristic for a deep learning architecture yielding a UCB-style contextual bandit algorithm managing exploration with expressive power. ----------------------- Having read the other reviews and the author response, I currently see no need to adjust my rating. I believe this paper would benefit from additional refinement and another peer review cycle. The core idea is good but the experiments seem rushed.

Strengths: The core idea is interesting: uncertainty heuristic might be viable on particular deep architectures.

Weaknesses: The paper struggles to explain gated linear networks (GLNs) to the unfamiliar. The experimental results are presented without analysis.

Correctness: The proposed procedure appears very strong empirically, but the experimental section has issues. A good experimental section not only presents results, but 1) provides a rationale for each experiment in terms of major claims of the paper and 2) attempts to explain patterns in the results. #1 is implicit, but seems to be "demonstrating generally superior empirical performance". Unfortunately it is not clear if the grid search over hyperparameters for GLCB is leaking information about the test set. More detail is required in Appendix B. Furthermore, there is no idea of the sensitivity of GLCB to hyperparameter choices (i.e., what if tuning is not done?). #2 (analyzing results): it is apparent that GLCB compares more favorably on classification tasks than regression tasks, but there is no comment or follow-up experimentation. Question: does GLCB do better on regression tasks if the (scaled and shifted) reward is randomly rounded to a Bernoulli to induce a classification problem?

Clarity: No. (See additional feedback for suggestions).

Relation to Prior Work: Yes.

Reproducibility: No

Additional Feedback: Equation (1) claims to be a loss function but there is no label. The exposition early in section 2 should explicitly state GLNs are a density estimation technique that is applied to classification. For those unfamiliar with GLNs, it is unclear where line 15 of Algorithm 1. Neither the submission nor the supplemental discusses this further.