
Submitted by Assigned_Reviewer_1
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper presents a clear and readable comparison of classical lambdareturns (that lie at the heart of TD(lambda)), gammareturns (which were introduced more recently) and Omegareturns (which are derived in the current paper). I think this perspective (which builds on the insights in the gammareturn work) is really interesting and instructive. I don't think the new algorithm that was presented is all that helpful, but the paper provides valuable insights on an important and complex core problem in learning.
Quality: very good Clarity: excellent Originality: good Significance: good
Line 068: Can L be infinity? If not, can you be more specific about how these results relate to standard applications with no fixed step length?
Line 078: I think L is missing from w_lambda(n).
Line 270: I think the figures would be easier to compare if the axes were fixed between plots. I think I'm supposed to be relating (a) to (d) and it's difficult to do so with the varying scales.
Line 282: So, it looks like many of the (f) plots show a pattern that involves a lot of weight on the onestep estimator (TD(0)) and sometimes a bit on TD(1). Does this work provide additional insight into how good estimator TD(0) is? It seems like it's quite easy to compute and perhaps nearly as good as what is coming out of the complex estimator. Some insight into what is gained by the small weights on the other estimators would be illuminating.
Q2: Please summarize your review in 12 sentences
The paper presents a clear and readable comparison of classical lambdareturns (that lie at the heart of TD(lambda)), gammareturns (which were introduced more recently) and Omegareturns (which are derived in the current paper). I think this perspective (which builds on the insights in the gammareturn work) is really interesting and instructive. I don't think the new algorithm that was presented is all that helpful, but the paper provides valuable insights on an important and complex core problem in learning.
Submitted by Assigned_Reviewer_2
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
***After author response/discussion***
I have moved my score up for two main reasons: 1) The authors helpful response did clarify the experimental setup for me, which I agree does not need to include True TD(lambda). I have edited my review below to reflect this.
2) I have come around to thinking of the approximate algorithm as a proof of concept rather than a primary contribution. Seeing this paper as describing a solid foothold in this direction, and some early evidence that there may be something to be gained by going further makes me less concerned about its practical impact, and more optimistic about its chances of provoking questions and possibly inspiring followon work.
Paper Summary
This paper proposes a novel replacement for the lambda return for mixing between bootstrapping and Monte Carlo estimates in reinforcement learning. Specifically, the Omega return in principle accounts for the covariance of returns at different depths (whereas the lambda return implicitly assumes the returns at various depths are independent). Computation of the Omega return is formulated as a generalized least squares problem, which impractically relies upon unavailable quantities. The paper describes one method for approximating these quantities using experience data. Empirical results in four domains suggest that the approximation produces qualitatively sensible weights, and that the use of the approximate Omega return yields more accurate offpolicy estimates of policy quality.
Quality
I believe the derivation of the Omega return weights (the main technical result) is technically sound. The various steps of approximation are generally wellmotivated. One argument that I did not totally follow starts on line 208. The argument is intended to motivate the assumption that the covariance between two different depth returns is approximately equal to the variance of the shallower return. Intuitively I can see why this might be reasonable; it seems sensible in many problems to expect the return at one depth to be very similar to a slightly deeper return (less so for much deeper returns, but therein lies the approximation). The thing I didn't get is the argument that the istep return should be nearly independent of the depthj TDerror. The claim seems to be that they should be nearly independent because they are *conditionally* independent given state. As I'm sure the authors are aware, there's no reason that two conditionally independent variables should also be independent. If you want to assume those two quantities are independent, it's important to develop a better founded sense of what that assumption really means.
The claim that the approximation yields accurate weights is a little imprecise. It is mainly supported by eyeballing the approximated quantities to see that they are qualitatively similar to the targets. I'd have preferred a more quantitative study of the accuracy of the approximation, perhaps studying the relative impact of the various stages of approximation proposed in Section 4.
The empirical claim that using the approximate Omega return yields more accurate value estimates than the usual lambda return seems wellsupported.
Clarity
I found the paper to be wellwritten for the most part. I particularly appreciated the introduction which I thought laid out the issues at hand clearly and effectively set the stage for the technical contributions. The main area for improvement in clarity I see is Section 4. The section takes the reader through multiple different stages of approximation and it starts to feel a bit like a runon sentence ("and then...and then..."). I'd humbly suggest that the authors take another look at this section with eye toward providing the reader with more structure to hold on to. Some things that might help include additional subheadings, some text along the way to place the current step in the overall approximation plan, and/or perhaps an executive summary at the end of the various assumptions and their consequences.
The other major impediment to clarity is the relegation of the description of the experiments to the supplementary material. This make it harder to determine the meaning of the results; the figures even contain terms that are not defined anywhere in the main text. The authors should be aware that they are permitted to place their references section on the 9th page, leaving nearly 3/4 page of available space. While I understand that the appendix asis cannot fit in that space, I expect that with some judicious editing significantly more detail regarding the experiments can fit in the main text.
Originality
As far as I am aware, the Omega return is a novel concept, and the derivation and subsequent approximation presented here are also original, though they make use of wellknown machine learning tools.
Significance
The conceptual problem with the lambda return is very real, and effectively addressing the bias in the estimator could potentially have an important impact on RL practice. However, this is not a novel insight  the issues with the lambda return have already been discussed in the literature. So the burden of significance falls to the Omega return itself.
The principle underlying the Omega return is a good one, and I wouldn't say that its formulation was a priori obvious. So one source of potential significance (that the paper itself raises) is that the Omega return, as described here, may now serve as a target for approximation for future study.
I do see potential value in having this idea available in the literature as a foundation for future work.
The other possible source of significance is the particular approximation described here. Unfortunately, to my eyes it looks like this method is fairly complicated and limited compared to the lambda return. As George Box is famous for saying "All models are wrong, but some are useful." Both the approximate Omega return and lambda return are wrong, but from these results I'd be forced to conclude that the lambda return is currently more useful. I doubt the specific approximation method in this paper will significantly impact RL practice. At best it serves as a proof of concept that one can approximate the Omega return to some extent, giving hope that a better approximation will come down the road eventually.
Q2: Please summarize your review in 12 sentences
The motivation for this paper is of genuine interest, though the practical impact of the presented approach seems small in exchange for a great deal of complication. In the end, I am convinced that the clear statement of the problem of approximating/improving upon the Omega return may be of sufficient significance.
Submitted by Assigned_Reviewer_3
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper deals with policy evaluation (for a single state) and proposes an alternative to eligibility traces for weighting nstep returns to provide a value estimate. It starts by presenting complex returns (schemes for weighting nstep returns, typically eligibility traces, but also the more recent gammareturn). Then, the authors formalize the problem of finding the weights as a generalizedleast squares problem, such as minimizing the variance. The necessary covariance matrix (Omega) being unknown, it is estimated from data. To avoid overfitting, an approximation is done (replacing the covariance between two returns by the variance of the shortest return, this variance being itself parameterized). Then, some experiments in an offpolicy setting (not addressed in the paper) are provided.
Major comments: * the addressed topic is important, but this work is too preliminary. The proposed approach is an approximation of an ideal algorithm based on an unrealistic assumption. Moreover, it is hardly applicable (as it requires a set of independent trajectories and a value function estimate over the whole state space to estimate a single value). * the whole paper deals with onpolicy learning, this setting should have been considered in the experimental section (with possibly additional experiments in the offpolicy setting). Also, it would have been better to study bias and variance of the estimates (even if the RMSE is related to these quantities) * the proposed approach could have been compared to the (recent) emphatic TD learning approach of Sutton et al. (especially regarding the offpolicy aspect)
Minor comments: * overall, the writing could be improved (notably better introduction of problem and notations, more formal writingnot use var for the variance and the empirical variance for example, etc) * l.87: this equation makes no sense * l.210 and 213: index t+j for the reward > t+j1 * in the 5x5 gridworld, the true value function (ground truth) could be easily computed
Q2: Please summarize your review in 12 sentences
This paper proposes an alternative to eligibity traces for weighting nstep returns to provide a value estimate of a given state. The problem addressed here is important and interesting, but the writing could be improved and especially this work is too preliminary (acknowledging that this is an interesting research direction): the proposed approach is an approximation of an ideal algorithm based on unreasonable assumptions and is only applicable in very constrained conditions, and the experiments do not address the first motivation of the paper (namely, improving the biasvariance tradeoff of eligibility traces in the onpolicy case).
Submitted by Assigned_Reviewer_4
Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
130: "...the unbiased estimator with the lowest..."  that is a *linear* function of y, don't forget. 157: "Notice that ... if Assumption 2 holds. Stil..." Make a better argument why this might still be reasonable even if your model is misspecified. The issue of correlated return estimates comes up in the GPTD and related methods, I believe. May be worth looking for connections. Constraints on estimated \Omega probably invalidate consistency/asymptotic unbiasedness results. Comment?
Q2: Please summarize your review in 12 sentences
The authors formulate a weightedreturn estimate of value using GLS that similar in concept to TD\lambda. Core idea is good; not clear how practical it is and what the consequences of approximation approaches are.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 5000 characters. Note
however, that reviewers and area chairs are busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
Thank you for your feedback. We will try to address
the major questions:
Reviewer 2:
> I believe the
experiments are missing an important comparison
If the important
comparison is true online algorithms:
Our primary goal was to
compare the lambda and omega returns, not the algorithmic machinery that
can be built around them. Our comparisons are *not* to TD(lambda) (true
online or not). In our experiments we fix the algorithmic machinery and
vary only the complex return used. That is, we compare the lambdareturn
to the omegareturn while fixing all other aspects of the algorithm. Also,
the lambdareturn is computed from batch data generated by a fixed policy
using the *forward* view. So, the lambdareturn that we use is the desired
target that the true online methods strive to produce.
If the
important comparison is emphatic algorithms:
The emphatic
algorithms also make use of complex returns. The appropriate comparison
would be emphatic TD using the lambdareturn and emphatic TD using the
omegareturn. One might be able to view the emphasis at time t as a
modification to the weight, lambda^t, given to the update at time t, to
produce a new complex return. However, we are aware of no papers that make
this argument. Still, in the setting used in our experiments (evaluating
only the initial state), this new complex return degenerates to a
lambdareturn (since the only state with nonzero emphasis is the initial
state).
Furthermore, our experiments are for offpolicy *policy*
evaluation. That is, we only care about the value of the initial state.
The GTD family of algorithms is for offpolicy evaluation (for some
distribution over states). When used only with an emphasis on the initial
state, they degenerate to simple importance sampling (which we compare
to).
> I'd be forced to conclude that the lambda return is
currently more useful. We agree  the omega return and its
approximation are more complicated and more challenging to compute. For
many (most?) applications (e.g., if the pertime step complexity must be
linear in the number of features) the lambdareturn is the clearly
superior choice. However, we do not feel that this is reason to abandon
this line of research:
1) Approximations of the omegareturn, even
in their primitive state, are already of practical use in some cases,
including offpolicy policy evaluation (as shown in our
experiments).
2) The lambdareturn has decades of research
improving it (including the recent true online work). It is our hope that
future similar advancements might allow for efficient approximation of the
omega return.
3) We view the lambdareturn as an efficient but poor
approximation of the omegareturn. There is an abundance of possible
approximations to consider, ranging from computationallyefficient but
poor approximations to inefficient but good approximations. For another
efficient approximation, we could use the lambdareturn but where lambda
is automatically set to approximate the omegareturn as closely as
possible. In this paper we are proposing the omegareturn, and so a
detailed comparison of all possible approximation schemes is beyond the
scope of this paper. We look forward to other researchers each proposing
their own approximation that suits their personal desires, be they linear
time complexity at the expense of performance, or maximizing performance
at the expense of computational complexity.
Reviewer
6:
> Comment? You are correct  the approximate omegareturn
is not a BLUE or even necessarily an unbiased estimator given the
assumptions under which the omegareturn is a BLUE.
ALL
REVIEWERS: The lambdareturn is at the foundation of RL, but is rarely
questioned. This paper attempts to change this  we question the
motivation behind the lambdareturn and find that it is not the "best"
possible target. The primary goal of this paper is not to provide a single
specific TD algorithm with a new complex return, but to study what the
target for TD algorithms should ideally be. In short, this paper is about
making progress in our understanding of complex returns  it is not about
providing a specific algorithm that can immediately replace and improve
upon the entire family of TD(lambda) algorithms. 
