Reviews: Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

The submission studies the adversarial online learning in episodic loop-free Markov decision processes. The importance of this work is that it is the first to provide the understanding to an adversarial online learning problem where the transition function is unknown, the loss functions are changing, and each feedback is bandit. The related work clearly describe the line of this research field from fixing an unknown transition and an unknown loss function to the setting studied in this submission. Although the MDPs considered in the submission is L-layered and loop-free, the results and the analysis pave the way for general MDPs. The main idea is the design of the confidence sets to include the optimal occupancy measure which induces the optimal policy. Two ways to construct the confidence sets are proposed, one with the reachable probability \beta assumption and the other without. The confidence sets are also responsible for producing policies with sufficient degree of exploration. The other idea is the construction of estimated loss functions from bandit feedback, so that an optimistic occupancy measure balancing the trade-off between the loss and the distance to the previous chosen occupancy measure is computable. Two algorithms are proposed and analyzed. When the MDP satisfies the \beta > 0 assumption, Bounded Bandit UC-O-REPS achieves an O ̃(L|X|\sqrt{|A|T}/^beta) regret. When the \beta > 0 assumption is removed, Shitted Bandit UC-O-REPS achieves a regret bound of O ̃(L^{3/2}|X||A|^{¼} T^{ ¾} ). Originality. The submission is the first one to tackle the two challenges, unknown transition and bandit feedback, at the same time. Quality. Rigorous and clear-to-follow proofs of the analysis are provided in the supplementary. Clarity. The paper and the analysis are self content and well written. Significance. The submission provides insights and solutions to the bandit online MDP learning. Although the problem instances are restricted to episodic loop-free MDPs, the submission pave the way for general MDPs. ------------------------------ Update Thank you for the response. The feedback from the authors is appreciated has been reflected in the overall score for the submission.

Reviewer 2

After reading the response, I believe the authors could make the problem formulation and algorithms more clear. I am raising my score accordingly. ================================================== Overall, I feel this paper is not self-contained and incomplete. There is no complete algorithm being presented. Some of the notations are not well-defined and somewhat confusing. The authors also spend much effort explaining the proofs while the problem formulations and algorithms are not clear. I believe the priorities should be reversed. My detailed comments are as follows: 1. Section 4 on page 4: This seems to be a partial/incomplete introduction to UCRL type algorithms. Many of the key components in the UCRL algorithm which are likely needed in this paper (such as the empirical transition counts, the notion of epochs, etc.) are not explained. It is therefore not clear what is the purpose of this section. In particular, (1) the notation $\overline{P}_t$ seems being only verbally defined as the ``empirical transition function''. Its precise mathematical meaning is unclear. (2) The elements in $\Delta(M,t)$ seem to be probability measures q(x,a,x'). How does it also contain conditional probability measures P(x'|x,a)? It might better to define explicitly what \Delta(M,t) is. (3) What is $\epsilon_t(s,a)$? It is not defined anywhere in the paper. What is the role played by this constant and how does it affects the performance of the algorithm? 2. Section 5.1: This section is not self-contained and confusing as there is no algorithm presented. Instead, the authors simply refer to a previous paper [4] and mention a few things that need to be changed there. I believe this is not a good way of presenting algorithms as the readers have to look back into [4] in detail and connect dots and pieces in order to understand what the algorithm does. 3. Section 5.2: Since UC-O-REPS is not clearly explained and the empirical transitions are not defined. It is rather difficult to follow this section, and it is not clear what is the motivation for defining a new transition $P_t^*$. 4. The proofs seem to follow the proof for UC-O-REPS [4] rather directly. However, many of the key components in UC-O-REPS are not explained. For example, this proof does not explain why the algorithm requires the notion of epochs here or why the empirical transition matrix should be updated per epoch in the way suggested by [4]. Other comments: 1. Line 130: Change ``different than'' to ``different from''. 2. First equation after Line 130: A summation is missing in the computation of expectation. 3. Line 132: ``For this purpose we constraint the confidence...'': Change ``constraint'' to ``constrain''. 4. Line 134: It is not clear what the elements in $\Delta_\alpha(M,t)$ are. The elements in $\Delta(M,t)$ seem to be q(x,a,x'), which are incompatible with q(x). 5. If $\Delta_\alpha(M; t) = \emptyset$, then, [$q_{t+1}$ is chosen to be an arbitrary occupancy measure].

Reviewer 3

1. Originality and Quality. This is work is original to the best of my knowledge. Adequate citation is given to related work. 2. Clarity. The paper is very well written and straight-forward. Adequate intuition and discussion are provided in the paper. The writing is reader-friendly as proofs for important lemmas are included in the main paper. 3. Significance. The idea of the first algorithm is straight-forward but the proof needs slight modification due to the bias. The idea of the second algorithm is not very trivial. ========= Update ========= Thank you for the feedback. My overall score remains the same.

Paper ID:	1308
Title:	Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function

Reviewer 1

Reviewer 2

Reviewer 3