Reviews: Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

************ [updated after the rebuttal] I am quite positive about this paper, robust approaches are important in practice. I understand that for heuristic approaches like this, proving theoretical guarantees is hard. The authors at least give some intuition about its asymptotics in the rebuttal, which should be easily formalized (and I would request it to be added to the paper if accepted). The authors did a very good job argumenting for their approach and evaluating its performance numerically, on realistic examples. However, NeurIPS might not be the best venue for such work. It only covers evaluation of a policy. Perhaps a follow-up paper which covers incorporating this approach into optimization (policy improvement), which the authors say is left for future research, would be more suitable. If the paper is rejected from NeurIPS, I would like to encourage the authors to add some (even if weak or asymptotical) theoretical guarantees and submit it elsewhere. Also some of the approaches mentioned in the books by Warren Powell (on Approximate Dynamic Programming and on Optimal Learning) might help the authors to refine or support their approach, or serve as additional benchmarks. ************ Originality: I can't judge since I am not familiar with work in this area. Quality: The proposed algorithm is designed based on in-depth understanding of TD and MC approaches and on their strengths and weaknesses. Clarity: very good, easy to read, no typos. Significance: A potentionally useful algorithm as shown on a wide range of problem instances, but its significance hasn't been evaluated by benchmarking with other methods. I have not identified ay minor comments such as unclear sentences or typos.

Summary. The authors propose a novel method for adaptively using either the MC method for policy evaluation or the temporal difference method. The authors aim to solve the problem of balancing bias and variance in the reinforcement learning setting and to this end propose the Adaptive TD algorithm. The algorithm takes as input a set of sample episodes which it uses to bootstrap some confidence intervals for the value function of each state. It then compares the TD estimate for each of these states with these confidence intervals and keeps the TD estimate if it fits inside, otherwise, it picks the middle of the confidence interval as it assumes the TD estimate is essentially biased and inaccurate. The process repeats for a number of epochs (since the TD estimates change as the value function estimate for the future state is updated by the adaptive-TD rule). I think this paper shows promise: the method is, to my knowledge, original and from the numerical experiments seems to achieve the target the authors set for it - dominating TD and MC in the worst case. I think there is room to build on top of this method as well. However, overall, I found the presentation of the paper to be insufficiently clear, with key points not being sufficiently addressed (see below) and the method to have limited practical applicability. I therefore vote to reject this paper in its current form. Detailed comments: It is hard to get excited about this paper. I like the idea of using confidence intervals on the MC estimate to figure out whether or not the TD estimate is reliable. To my knowledge, this is original and makes a lot of sense. One major concern I have with this paper is the fact that the computational complexity increases with time. I think this severely limits the practical relevance of the method. The presentation, in my opinion is poor and the method and its relevance are very hard to understand from the paper. I found the choice of putting the pseudocode of the algorithm in the Appendix to be particularly uninspired. I think the setting is not presented sufficiently clearly: what is the practical problem that is solved here? Is this new way of estimating value functions supposed to be used after every time step n of a RL algorithm or is it a way of approximating the optimal policy after training is complete? It seems that running this algorithm throughout the execution of a RL agent is prohibitively expensive (several epochs over all observations so far). I have had a hard time finding the answers to the following questions: 1) The confidence intervals are computed based on inference of m neural networks trained on the MC target. In my opinion, the difficulties of this training step are not discussed sufficiently. If the MC target is not reliable enough to be used why would these networks' inferred estimates be more reliable? Another insufficiently discussed aspect is the fact that the confidence intervals assume these estimates are i.i.d. gaussians (which is a fairly bold assumption in this case). How much is the algorithm affected by this assumption - can we assume whatever distribution here without degrading numerical performance? Using a GP (as the authors indicate would be remedy) or other such techniques might offer the comfort of reliable confidence intervals, but further increase the already large (in my opinion) computation cost. Putting a confidence interval on the output of a neural network trained at runtime is far from a trivial task, which brings us to point 2). 2) In the experiments, m = 3. This would lead to very large confidence intervals. How are m, \alpha and the confidence threshold chosen? If for example we pick m to be large, this would create very small confidence intervals which would reduce the algorithm to always using MC estimates (since the TD estimate is always outside)? This seems like an arbitrary way of getting confidence intervals (these intervals should depend on the number of sampled states n, not m - handpicked in the algorithm design phase) that heavily relies on the parameter choice. 3) What are the drawbacks of this approach? I do not think the authors address these sufficiently. In the experiments on Lab2D - Figure 2, it seems MC ensemble(3) is not done learning. Would it outperform Adaptive TD(3) given more time? What is the relative runtime between the algorithms? In Figure 3, Adaptive TD(3) behaves very poorly on 2 of the 6 maps. This shows that its performance is not reliable and highly problem dependent - can these problems be identified in advance? Figure 4 does however show better performance. The figures in the appendix also show the algorithm achieves the goal set by the authors: dominate both TD and MC in the worst case over a large number of settings. I find Figure 1 to be uninformative and could be sacrificed to include the algorithm pseudocode in the main body of the paper. ================ Post-Rebuttal ================ I am happy with the clarifications made by the authors. I now have a better understanding regarding what exactly are the effects of increasing $m$ on the confidence intervals on the MC estimate. I still find it odd that numbers this small (like m=3) provide usable confidence intervals. Regarding the "computational complexity increasing with time", I was assuming the algorithm is used in a reinforcement learning loop. In consequence, I updated my score to 6.

Paper ID:	6386
Title:	Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates

Reviewer 1

Reviewer 2

Reviewer 3