NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1430
Title:Information-Theoretic Confidence Bounds for Reinforcement Learning

Reviewer 1

- Originality: Although related ideas have been proposed in [1], here they are extended to more general results involving confidence bounds over rewards/observations, and are applied to more general multi-step RL settings. - Quality: The analysis is quite thorough and seems correct, although I have not checked the proofs in detail. - Clarity: The paper is well-written for a theory audience, but fairly dense and requires careful reading, which I think will limit its reach to the wider community. It would be great to make it more accessible to a more general audience, as the ideas it contains are fairly intuitive at their core. One suggestion would be to include illustrative figures to convey the general intuition, for example for the case of the linear-Gaussian bandit, since the confidence sets have a natural geometric interpretation in terms of the variance of the posterior. - Significance: I think these results have the potential to be significant, because they give general tools for computing regret bounds for the RL algorithms applied in different scenarios. The analyses of the examples given (linear bandits, MDPs, factored MDPs) essentially all follow a recipe made possible by the results relating the confidence bounds to the regret. Specifically, they are: 1) Construct a confidence interval based on the mutual information using the characteristics of the problem at hand (linearity/Gaussian noise assumptions for the bandit, specific forms of the prior for MDPs) 2) Bound the sum of the information gain. Combining these two then gives a regret bound. It is possible that this same approach could be applied to a broad class of problems other than the ones considered here. ### Minor Comments: - Line 26: please add references for "Most analyses..." - Line 69: should "a" be "A_\ell" here? This notation is used beyond the first step later in the paper, but here is says "Y_{1, a}". - Section 4 and elsewhere: It seems the \theta variable is overloaded at different places. For example, Line 111 \theta refers to the true model but elsewhere it refers to the random variable. Please use \theta^\star or something similar when referring to the true model.

Reviewer 2

## Summary: This paper builds upon the information-theoretic analysis of Thompson sampling in Russo and Van Roy (JMLR'16). The authors develop a per-stage bound on the expected regret in terms of the mutual information between a parameter describing an unknown environment model and the current action and observation outcome. The per-stage regret bound can, in turn, be used to obtain a bound on the total regret. The authors, also, show that the per-stage regret bound is related to a bound on the deviation of the reward function from its mean. The proposed bounds are computed explicitly in the cases of linear bandits, tabular MDPs, and factored MDPs. ## Comments: The paper offers novel regret bounds for Thompson sampling and UCB algorithms but the key idea is mostly a straightforward extension of Russo and Van Roy (JMLR'16). The detailed derivations for all three cases (linear bandits, tabular MDPs and factored MDPs) and the explicit computation of the Gamma term for common probability distributions are particular strengths of the paper. A weakness of the paper is that the theoretical results use standard properties of mutual information and Gaussian/Dirichlet/Bernoulli distributions and hence lead to very general and very loose bounds. It is interesting to see the order of the regret bounds for the MDP setting but the bounds seem quite loose and not very useful for developing new algorithms. There are some problems with the notation in the paper. The use of distribution, density, and measure is incorrect. In the first definition of KL divergence, P is used as a probability measure. In the second, definition of mutual information P is used with a strange notation which comes from ref. [10] but is not introduced in this paper. Without a proper definition, the expression P(X,Y) is meaningless since the input for a measure function is a set. New notation \mathbb{P} is introduced in Sec. 3.2; \ldots is missing in the history definition. There is a switch from \ell to t to denote the time steps in the middle of the paper, yet H_\ell is still used as to denote the observation history. ## Update after rebuttal The authors acknowledge that the notation should be improved and offer to add simulated examples of the information gain and regret bounds. This would certainly strengthen the paper and address my main concerns.

Reviewer 3

This paper integrates information-theoretic concepts into the design and analysis of optimistic algorithms and Thompson sampling. By making a connection between information-theoretic quantities and confidence bounds to obtain results that relate the per-period performance of the agent with its information gain about the environment, thus explicitly characterizing the exploration-exploitation tradeoff. The resulting cumulative regret bound depends on the agent’s uncertainty over the environment and quantifies the value of prior information. This paper show applicability of this approach to several environments, including linear bandits, tabular MDPs, and factored MDPs. These examples demonstrate the potential of a general information-theoretic approach for the design and analysis of reinforcement learning algorithms.