__ Summary and Contributions__: In the context of preferenced-based RL, this paper identifies an assumption on the answers to the preference queries and introduces an algorithm such that, when combined, this results in recovering net-optimal policies with respect to the underlying reward function generating the comparison feedback.

__ Strengths__: The paper provides insight into preference-based RL. It shows that a winning policy might not actually exist if people were to give perfect trajectory comparisons, which is in itself an important point. It also shows that under the assumption of the probability of the answer being a linear function of the difference in cumulative reward, an algorithm exists that yields epsilon-optimal policies, studying the sample complexity in both comparisons required and exploration steps. Concretely, the results establish that the exploration complexity of learning from comparisons can be similar to that of value based RL in the simulator setup and slightly worse in the exploration setup.

__ Weaknesses__: There are two main weaknesses.
First, I’m not sure whether the algorithm is meant to be the core contribution, or the analysis. If it’s the algorithm, then the paper needs to actually test the algorithm in more than toy settings (and ideally with real humans, rather than simulating answers with BLT with two parameter settings). But if it’s the analysis, I almost feel like the experiments are distracting, or at least overstating and drawing away from the main contributions. I’d love to hear the authors’ perspective on this, but my suggestion would be to either a) get the best of both worlds by running a more serious experiment, or b) edit the paper to highlight the analysis and justify the experiments as showing what the algorithm does empirically and perhaps aiding with some qualitative analysis of the resulting behavior when applied to simple tasks, aiding in the understanding of the algorithm.
Second, the assumption this is based on is very strong. It is pretty clear that real people won’t satisfy it, as decades of behavioral economics research has shown us. Two things would make the analysis much stronger: expanding it to a family of models, and providing an analysis of how when the assumption does not hold, how the approximation error would propagate into the bounds. For the former, a function like f(difference) instead of linear, including the BLT model, would be nice.
Further potential weaknesses:
- Prop 1 shows that a Condorcet winner amongst policies might not exist due to cyclic preferences induces by trajectory preferences. However, recent work by [1] showed that a von Neumann winner always exists. Using this as motivation for switching to the strong assumption 1 only makes sense if one can show that the von Neumann winner (which is a stochastic policy now) would have poor performance with respect to the value function in the underlying MDP.
[1] Dudík M, Hofmann K, Schapire RE, Slivkins A, Zoghi M. Contextual dueling bandits. arXiv preprint arXiv:1502.06362. 2015
- Assumption on reward scaling: Lines 87-88 mention that both the rewards and cumulative value of H states are bounded between [0,1]. If the rewards are between [0,1] shouldn't the correct scaling of the values be from [0,H]? How does this affect the scaling of the sample complexity bounds?
- Assumption on state space: in Line 81, the paper assumes that the state-space can be partitioned into H sets. What this implies is that if the same set of states can be encountered at each time-step, the effective size of the state space is S_eff = S H. I believe that the bounds in this case would incur additional H factors, making it sub-optimal? Is this assumption necessary? Is this also assumed in prior work and the bounds compared with for the value-based RL setup? If not, I believe that the comparisons with the existing bounds might be unfair {same for point d above}.
More minor points on experiments:
- For the synthetic experiments, the comparisons are generated according to a BTL model using the underlying reward function. How robust are the findings if we change this model?
- Even for the BTL model, it would be good to have a plot with varying c for the proposed algorithm to get an understanding of the robustness properties of the procedure to noise in comparisons.
- The experiments seem quite small scale with the synthetic grid-world (4x4) and Random MDP (5 steps, 4 states per step). How well does the overall method do on large scale setups? It would be good if the procedure could also be run on for instance, the mountain car (and other such envs) as was done in Novoseller et al. 2019.
- Scaling plots: In order to understand if the scaling with the terms S, A and H are indeed tight, it would be good to have experiments which study these scaling laws.

__ Correctness__: At a first glance, the proofs in the appendix look fine to me and the results stated in the main paper seem correct under the assumptions listed.

__ Clarity__: Overall, the paper is well motivated and well-written. The theorem statements are precise and the paper does a good job of comparing the different results in the form of corollaries and discussions.
Small nitpick: It’d be useful after assumption 1 to write one sentence that now a Condorcet winner exists, because preferences are based on value and there exists a value maximizing policy. This would just help exposition, of course it's implicit right now.
There are a couple of minor typos which need to be addressed:
a. In the MDP formulation, it looks like the MDP is deterministic and so are the policies but the paper mentions randomness in the MDP (see Prop 1 and thereon). I believe that the transition model needs to encode random transitions.
b. Definition 1: C(\eps, \delta) -> \Psi(\eps, \delta)
c. Proposition 2: C -> C'
d. Corollary 4: Should be log(S/\delta)
e. Proof of prop 1: should be pi0, pi1, pi2 (rather than pi1, pi2, pi3)

__ Relation to Prior Work__: The paper compares positions itself with relevant prior work in the first couple of pages and also compares the established results with previous known results in both preference based and value based reinforcement learning. It would be nice if for completeness, some insights into the dueling bandits framework could be provided in the appendix for readers not familiar with the topic.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: This work provides contributions to the preference based reinforcement learning (PbRL) problem. Contrary to classical RL where numerical rewards are observed, only comparisons between trajectories are obsersved in PbRL. Intuitions about this problem are given by discussing what happens under different assumptions in Propositions 1 and 2. Then, algorithms using RL and dueling bandit algorithms as subroutines, are described and analysed in terms of both step and comparison complexities required to find the optimal policy in a PAC setting. Finally, numerical experiments are provided as illustrations of the theory.

__ Strengths__: This paper provides fundamental contributions to the PbRL problem, which is by far less well studied and understood than the RL problem, in spite of its potential broad application in contexts where numerical rewards are missing.

__ Weaknesses__: Some assumptions may be too constraining for real world applications.

__ Correctness__: The method is correct. The empirical methodology corresponds to the theoretical setting.

__ Clarity__: The paper is very well written.

__ Relation to Prior Work__: Yes, related PbRL works are recalled, as well as dueling bandits algorithms that are used as a subroutine.

__ Reproducibility__: Yes

__ Additional Feedback__: Some typos:
- line 76: two sentences should be permuted
- line 79: did you mean "random" instead of "randomized"?
- line 110: C(eps, delta) should be replaced by Psi(eps, delta)
- line 112: precision "for any arm a different from hat{a}"
- lines 137 and 139: I think that s_h should be replaced by s
- lines 157-158: choose either C or C'

__ Summary and Contributions__: The paper looks at preference-based reinforcement learning from a sample complexity standpoint. In preference-based reinforcement learning, feedback is provided on pairs of trajectories to say which is better/preferred. Thus, the feedback is weaker than traditional RL in two respects. First, comparisons are qualitative (x > y) and not quantitative (x is worth R(x)). Second, information about the utility of individual steps in the trajectory is lost and comparisons are made in the aggregate.

__ Strengths__: The analysis seems thorough and novel.

__ Weaknesses__: This paper does not make an explicit case for the utility of the preference-based framework. It seems interesting and arguably understudied. But, I think more of an argument should be made for its significance.
Alternatively, I'd find it interesting to learn more about the technical details of developing algorithms for this class of problems. Instead, a lot of the technical meat seems to be hidden in the behavior of dueling bandit algorithms.

__ Correctness__: The analysis seems appropriate and correct.

__ Clarity__: Paper is mostly clear, but I would have felt better informed to hear more about the internals of preference-based RL algorithms / dueling bandits. To me, the paper read as if it were an addendum to a prior paper and not really a standalone presentation.

__ Relation to Prior Work__: Yes, relation to both earlier preference-based RL work and sampling complexity analyses of the classical RL setting is addressed.

__ Reproducibility__: No

__ Additional Feedback__: Post author response:
I thank the authors for your response. I had been uninformed about preference-based RL and so I began to read the survey paper you shared. I still have misgivings about the paper, but I acknowledge that another of the reviewers (as well as the area chair) have expertise more directly relevant to the paper and I am comfortable giving their reviews/recommendations more weight.
Detailed comments:
"Although lower bounds for these bandit problems extends to PbRL," -> "Although lower bounds for these bandit problems extend to PbRL,"?
"For simplicity we assume S >=H": "S" hasn't been defined yet. Maybe move the sentence over by one?
Proposition 1 is interesting. I have played with these kinds of non-transitive distributions. I suspect some readers may be unfamiliar with the phenomenon. Maybe include a citation? Even a wikipedia reference might be useful (https://en.wikipedia.org/wiki/Nontransitive_dice).
"a MDP" -> "an MDP"?
"In this Section," -> "In this section,"
"Preferece-based Exploration" -> "Preference-based Exploration"
"reward-free learning literatures" -> "reward-free learning projects"?
"we compare them and feed it back": What is "it" referring to here?
"PEPS almost always get" -> "PEPS almost always gets"

__ Summary and Contributions__: This paper presents a finite time analysis of preference-based reinforcement learning. Overall, it fills up a blank piece of preference-based RL. Moreover, a new algorithm is proposed that has a theoretical guarantee.

__ Strengths__: The theoretical analysis is a significant contribution.

__ Weaknesses__: The assumption might be overly strong and unrealistic.
There are some errors in the proof. Need to be rechecked.

__ Correctness__: there are some issues in the proof (line number refers to the full paper file)
in line 493, the second inequality, s_h \in \tilde S should be s_h \notin \tilde S.
in line 508, the same issue
in line 511, = H should be \leq H
please recheck the proofs

__ Clarity__: The paper is easy to follow.

__ Relation to Prior Work__: The difference is clear.

__ Reproducibility__: Yes

__ Additional Feedback__: An overall major concern of this paper is whether Assumption 1 is overly strong. Note that when a practitioner gives a reward function, he/er is commonly not fully aware of the total reward, i.e., the V value. This happens very often, as practitioners need to adjust their reward functions for a good performance. What Assumption 1 says is that, when the preference is given, it will not lose good policies. If a practitioner is able to do that, he/er is likely to give a good reward function. So can the assumption be release to a probabilistic inequality, e.g., φ_s (\pi_1,\pi_2) ≥ C(v_{\pi_1}, v_{\pi_2}) with a probability? How will this change affect the analysis?