Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
After rebuttal: Thanks the authors for addressing my concerns. I have read the authors feedback and other reviews. I'll keep my original score. ------------------------ This work proposes an E3-style model-based algorithm for the finite horizon MDPs with known reward. The agent is able to collect data when running the algorithm and the goal is to find a near optimal policy. Early work includes , , and . Using ranks of error matrices to represent complexity and some proof techniques are related to  and . The paper is technically sound and clearly written. In the theoretical side, the authors prove a polynomial sample complexity bound in terms of |A|, H, and the rank of the model misfit matrix, thus avoiding the dependence on |S|. This work is also supported by the experiments. The empirical result shows that Neural E3 algorithm can achieve similar progress as standard algorithms. Some comments: 1. In general, I like the idea of this work. The algorithm utilizes the uncertainty over predicted states instead of directly maximizing the expected return. In the exploration step, the algorithm finds a policy that leads to mismatch between two models. Therefore at least one model is not the true model. The algorithm will then collect enough trajectories of data to eliminate the inconsistent model. 2. This paper assumes that the true reward R* is known and it seems to be too strong. Predicting future states include predicting future rewards when the true reward is known. My major concern is that the idea of considering the mismatch between probability distribution between models only works under this assumption. I was wondering whether this work can be extended to the case when R* is unknown. 3. The sample complexity result is important. However, the analysis might not be novel since some are closely related to and adapted from . 4. The experiment further strengthens this work. Closely related work  and  do not provide practical instantiations. In some simple tasks, the proposed Neural E3 algorithm achieves similar results as baselines. However, the Neural E3 works only in the deterministic environment so there is still a gap between the theory and the experiment. Is it possible to extend the Neural E3 to stochastic environment?
UPDATE after rebuttal ====== Thanks for the author response. I read the response and other reviews as well. In response, the authors explained the discrepancy between the idealized version and the practical version of algorithm. I marginally tend to accept this work. Quality ====== The theoretical analysis is sound for the problem and the proposed algorithm. Under the realizability and optimal planning assumptions, the authors prove that algorithm 1 and algorithm 2 can find near-optimal policy in polynomial sample complexity with factored MDP. The experiments show that the proposed practical algorithm (though may not exactly match the ideal algorithm analyzed) can overperform previous method. Clarity ====== The organization of the lemmas and theorems is good, and the statement is clear, precise and relevant. The model elimination part for the practical method is seemly not clear according to the paper. The authors tell us the methods to update model accuracy in practice. The model elimination stage seems missing in the practical algorithm (model update as in Sec.4.1 instead of model elimination?), which makes it unclear how the theoretical analysis could guide practical exercise. Originality ====== The paper extends the classical Explicit Explore-Exploit(E^3) Algorithm to large or infinite state spaces. The use of maximum disagreement as uncertainty measure is novel for exploration. Although there is some literature (‘MAX’ and [Deepak2019]) also using disagreement among ensemble models for exploration, the methods for disagreement measurement in this paper is different. However, for large state space, the author uses the structural properties in factored MDP, which is well-studied. Also, it seems that the technical part of proof follows similarly as in . [Deepak2019] Deepak Pathak, Dhiraj Gandhi, Abhinav Gupta. Self-Supervised Exploration via Disagreement. In ICML 2019. Significance ====== The theoretical result is well enough to point out that the maximum disagreement as uncertainty measure is sample efficient for exploration and can led to find near-optimal policies in model-based RL. However, the bound seems not as sharp compared to the sample complexity of other model-based algorithms e.g. UCRL. On the other hand, the empirical results show that the proposed method has superoir performance than previous ones. However, my main concern is that there is still a gap between the theoretical algorithm and the practical implementation (especially in continuous control), which should not be neglected.
This paper is well presented, and it also has a theoretical justification of the sample efficiency. Main questions: - How did you set the parameter \phi in Alg. 2? which was not presented in line 8 of Alg. 1? - From theorem 1, the sample efficiency depends on |A|^4. Is it possible to improve the algorithm such that it can also handle large/infinite action spaces?