NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1600 Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function

### Reviewer 1

The paper focuses on the important problem of designing optimal algorithms for exploration-exploitation (whose upper-bound matches the lower bound). The paper is not well organized and written. It is difficult to abstract from the mathematical formulation and grasps the key ideas behind the improvement of the regret bound. As far as I understood, the first important component in improving the bound is to use variance dependent confidence intervals (ie Bernstein). Together with the knowledge of H, this allows designing a tighter optimism (Eq. 6 in Alg 2) resulting in a \sqrt{HS} gain when bounding the term P'-P ie 3k. This idea is not novel and has been used several times in the literature for regret minimization. The oldest reference I can remember is (Lattimore and Hutter, 2012). As mentioned also in their paper, the problem is that there is no algorithmic solution known when this additional constraint is considered. This is one of the limitations of the proposed algorithm (as also mentioned by you). Moreover, I'm not sure that the optimization problem you would like to solve is well posed. Fruit et al. showed that the REGAL.D problem is not well posed in general (and it seems to me you are considering the same problem). The fact you mentioned you try to consider the relaxation proposed by Fruit et al. shows that you are aware of this issue but there is no explicit mention in the paper. Is the problem well defined? In line 4 you consider deterministic policies. In light of what shown by Fruit et al., are you sure that it is enough to consider deterministic policies? The second important part for proving the improved regret bound is to carefully bound the term (\hat{P} -P)(h_k-h^*). It is not clear how this result is obtained. The key component for this is Lemma 1 but I failed in understanding how this result for flat MDPs is used in the proof of Thm 1. In general, the proofs are difficult to follow since are not supported by informal arguments and high-level intuitions are missing. For example, Ortner (2008) provided an example showing that h_k may not converge to h^*. Here is not clear why this can be obtained in general. The only reason I can think of is when a minimum number of samples (that is polynomial in the MDP parameters) is available in each state-action pair. As far as I understood this is somehow the idea behind estimating the diameter. However, there is no discussion or explanation of this in the paper. To conclude, I cannot guarantee that Thm 1 is correct since I do not understand the important steps of the proof. Note also that the best known bound for UCRL2 with Bernstein is \sqrt{D \Gamma S A T} and has been proved by Fruit et al. 2019 as part of the tutorial at ALT 2019. You approach is thus improving by a factor \sqrt{\Gamma}. Moreover, you should mention the work of (Talebi and Maillard, 2018) that provides a problem dependent upper-bound using the environmental norm. I'm familiar with this literature (I've been working on similar approaches) and the paper is very cryptical to me. The authors mainly focus on the final results rather than providing intuitions of the techniques (eg mathematical tools) necessary to achieve such results. Finally, even in the case, the contribution is technically correct (the problem is well-posed and the proof is correct), this is not closing the question of designing optimal algorithms since EBF is not implementable. Line2 89-91: There is a reference to Eq 2 that should be 1 Definition 4 is not intuitive. You should provide a high-level idea (something more than just call it to count of arrivals) Line 137: You introduce the single step regret reg_{s,a}. this term is not common in the normal analysis but it can be expressed as negative advantage function (reg_{s,a} = -A(s,a)), right? This will provide a direct interpretation to the reader. Lines 150-153: you should explain where the modified version of (2) is coming from. In general, it is not easy to understand the utility of the flat MDP for the analysis of EBF. I didn't fully understand it even after reading the appendix. Lattimore and Hutter, PAC Bounds for Discounted MDPs, 2012 Ortner, Online Regret Bounds for Markov Decision Processes with Deterministic Transitions, ALT 2008 Fruit, Pirotta and Lazaric, Improved Analysis of UCRL2B, Technical Report, Tutorial at ALT 2019 [https://rlgammazero.github.io/] Talebi and Maillard, Variance-aware regret bounds for undiscounted reinforcement learning in mdps, ALT 2018 After feedback: thanks for the valuable feedback. It is now more clear what is the term you aim to bound (that is not |h^* - h_k|) but it is not clear the intuition/tools that enables you to save the \sqrt{\Gamma}. While I think that closing the gap between upper and lower bound is an important problem, it is also important to understand the tools/ideas enabling this improvement. What you call reg_sa is called optimality gap in RL (ie advantage function) (see Burnetas and Katehakis. Optimal adaptive policies for markov decision processes. 1997). If I understood correctly the idea is to use the optimality gap to make the MDP "flat". Can you explain better how this is used to design the algorithm? Can you provide a formal statement about the fact that the optimization is well-posed? Notes about the mistake in optimistic PSRL can be also found here: Qian et al. Concentration Inequalities for Multinoulli Random Variables. Technical Report, Tutorial at ALT 2019 [https://rlgammazero.github.io/docs/JFPL2018notesPSRL.pdf]

### Reviewer 2

Thanks the authors for the feedback. After a long discussions with the other reviewers, I maintain the positive review. But I agree with reviewer 1 in the sense that the authors should put significant amount of efforts in improving the presentation of the main results. Specifically, a crystal clear high level roadmap for the results should be added. -------------------------------------------------Original Comments----------------------------------- It appears to me that the contribution of the work is pretty solid. One (possibly minor) technical question, is the algorithm EBF computationally efficient? Is it possible to leverage EVI alike technique to efficiently execute this step? The quality of writing needs to be improved. For example, 1. Definition 4 is not sound, it is required that all te_k and ts_k to be defined w.r.t. L. 2. in line 142, why would the sum of reg be O(\sqrt{N})? Many more can be found throughout the paper. There is also some misleading parts in the related works section, for example, it has been pointed out in many places that [Bartlett and Tewari 09] contains some mistakes.

### Reviewer 3

I do understand that it is important to demonstrate the existence of algorithms achieving the regret lower bound while they may require some non-trivial cost or assumption. However, my major concern is the intractable complexity of the proposed algorithms, and thus the absence of the evaluation in canonical cases. As the lower bound is defined in a mini-max sense, achieving the mini-max lower bound does not guarantee efficiency in canonical cases with moderate sub-optimality gap. Hence, it would be helpful if a performance guarantee in terms of the sub-optimality gap is provided. Minor comments/questions: - In my understanding, Jaksch et al.'s lower bound is built upon no prior knowledge on the bias function. Is it possible to obtain similar lower bound under the assumption that the upper bound analysis requires? - According to [Agarwal and Jia 17], Lemma C.1 is an extension of results in [Osband and Van Roy 16]. Does it mean the result of [Osband and Van Roy 16] is untrue? - Is it possible to translate the upper bounds for the finite-horizon MDPs? - line 38: impossible - line 89: wrong reference? - Have you contacted the authors of [Agarwal and Jia 17] about the errors in their lemmas? I'm not very sure if someone else's mistake should be pointed out in this way.