Paper ID: | 2710 |
---|---|

Title: | Exploration Bonus for Regret Minimization in Discrete and Continuous Average Reward MDPs |

Originality: learning weakly-communicating MDPs was already attempted in Bartlett and Tewari who give a very similar result to the one given in this paper (based on span-regularization rather than exploration bonus). Quality: The results seem sound. Clarity: The paper is well-written. Significance: The idea of exploration bonus has received a lot of attention in recent results. However, given the work of Bartlett and Tewari, the benefit of using exploration bonuses is mostly computational. --- Edit --- Having read the authors response, I agree that this is the first tractable algorithm for this setting. I have raised my score.

Originality: The work builds on the SCAL algorithm, but contains some novel aspects. In particular, the technique used to prove optimism (i.e. that the gain used to compute a policy is an upper bound on the true optimal gain) is different from UCRL and SCAL. While the latter solve an expanded version of the MDP, this paper shows that the value update operator T_c^+ (similar to the Bellman operator, but additionally projecting to a span constraint) is optimistic wrt its application to the optimal bias function. This is tighter by a \sqrt{\Gamma} factor, though the \sqrt{\Gamma} factor remains in the regret bound. The authors comment that it is an open question whether the \sqrt{\Gamma} dependency can be removed from the regret. This term is not present in the Politex algorithm (http://proceedings.mlr.press/v97/lazic19a/lazic19a.pdf) in tabular MDPs, though the dependence on T is suboptimal there. The analysis of SCCAL+ relies on similar assumptions and discretization as UCCRL [2], but has a practical implementation. Quality: The technical content of the paper seems correct, though I did not read the supplementary material. The experiments are small-scale, and SCAL+ is not always better than SCAL or UCRL. However it does perform best in a setting with large Gamma, which is to be expected based on the theory. Clarity: The paper is well-written, and the proof sketches and explanations make sense. It would be useful to explain the continuous state contributions better, including placing them in the context of previous work, and explaining why UCCRL cannot be implemented. Significance: While there is no improvement in regret results, the proposed algorithms seem practical and the new technique for showing optimism may be useful in future works. After reading the rebuttal, I am still in favor of acceptance, and my score remains the same.

This paper extends the SCAL algorithms for regret minimization in MDPs from the discounted and finite horizon cases to the average cost case. The algorithm has the "optimism in the face of uncertainty" flavor, and the authors establish a sqrt(\Gamma S A T) regret bound, matching the one for SCAL. While there is still a sqrt(gamma) gap to the lower bound, the improvement is appreciated, as is development of non confidence-region style algorithms. At a high level, the authors show how to add an optimism term to a simple empirical estimate of the MDP dynamics and control the regret. As a consequence, the algorithm is implementable if one has an approximate planning algorithm. Crucially, they need to assume an upper bound on the bias function. The algorithm is extended to continuous state spaces with a well known discretization, but the proofs are quite technical. The authors claim the first implementable algorithm for this setting with a regret bound. Finally, fairly comprehensive experiments are provided (and expanded on in the appendix), showing that the proposed algorithm is sane, though I wish the authors would provide some quantization about the stability of the algorithm on the two continuous settings. The presentation is OK. Most of the definitions are provided, and the notation is (maybe understandably) very intricate. The rational for the exploration bonus terms (3) is presented well in the paragraph starting 153, and I wish the authors would emphasize these points more. My biggest criticism is that the authors were sloppy with their definitions and left quite a few terms undefined. A few terms that I don't think the general NeurIPS community will be familiar with (and should probably be defined) include: Cesaro limit, ScOpt (provide a citation), unichain, etc. Overall, I think the technical contribution is strong. It advances an important and current problem (learning in MDPs) into the average cost setting, which has generally thought to be the most difficult. The algorithms studied, while extensively applied in the bandit literature, are not used as often in RL: it is good to see their application into a new domain. other things: 282: forgot to finish your sentence. I wist the experiment results were given more interpretation.