__ Summary and Contributions__: In "How to Learn a Useful Critic? Model-based Action-Gradient-Estimator Policy Optimization", a new objective for deterministic actor-critic methods is introduced in which the critic is optimized to minimize the norm of the action-gradient of the TD-error rather than on the TD-error directly. The new objective is theoretically justified and a practical implementation via a fitted dynamics model is empirically evaluated. Furthermore, the approach is applied to the more general case of unknown reward functions.

__ Strengths__: The paper contributes a new perspective on off-policy actor-critic algorithms and is of clear interest to the NeurIPS community. The result that the difference between the true and approximated policy gradient is upper bounded by the action-value gradient of the policy evaluation error is very insightful. The theoretical grounding is solid.

__ Weaknesses__: In order to alleviate the problem of erroneous model-predictions, the approach relies on a computationally expensive ensemble, since noise in the dynamics estimation is directly propagated to the value-function and the policy which can cause tremendous stability issues -- this is already already known for other combinations of value- and model-based methods. The rather uncommon hyperparameter setting (swish activation, huber loss for the critic, complicated training scheme of the model) and the limited evaluation on fairly easy control problems raises doubts whether the approach is easily applicable to more complex tasks. The unknown reward setting is furthermore only evaluated for the Pusher, the asymptotic performance only for the HalfCheetah-v2 environment. All other evaluations are capped after 1e5 transitions, which is especially bad for the Swimmer-v2 environment where the resulting return does not reflect any meaningful performance (to a degree that it should not be included as is in the camera ready). Additionally, it would be interesting to compare the introduced approach to n-step rollouts via model-based value estimation (see reference below) within TD3 and the same model-setting.
Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning. Feinberg et al., 2018.

__ Correctness__: The proposed method is shown to be very effective on a variety of continuous control benchmarks. The approach is straightforward and should not be hard to reimplement, the exact hyperparameter setting is given in the appendix. However, the submission lacks a code submission.

__ Clarity__: The paper is exceptionally well written and a pleasure to read. All derivations are easy to follow.

__ Relation to Prior Work__: Due to its relation to Dyna, a few Dyna-based and model-based actor-critic methods are missing in the related work section [1,2,3]. Apart from that, the approach is very well put into context of current research and the differences are clear.
[1] Continuous Deep Q-Learning with Model-based Acceleration. Gu et al., 2016.
[2] Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning. Kalweit and Boedecker, 2017.
[3] Model-Augmented Actor-Critic: Backpropagating through Paths. Clavera, 2020.

__ Reproducibility__: Yes

__ Additional Feedback__: The broader impact section is very superficial and only partially discusses the societal implications of the work, however, the submission does not raise potential ethical concerns.
-----
Post-rebuttal update:
Thank you for the clarifications. However, despite the statement of the hyperparameter setting being common, it does seem to be rather approach specific. Thus, it is rather questionable whether any "reasonable setting" works well. In addition, the missing asymptotic evaluation (and the unkown results of such) and the points raised by fellow reviewers on TD-error vs. action-gradient are reasons for concern. This leads me to correct my scores downwards.

__ Summary and Contributions__: The paper proposes Model-based Action-Gradient-Estimator Policy Optimization (MAGE), a model-based policy optimization algorithm, which improves the performance of deterministic policy gradient by learning action-value gradient.

__ Strengths__: The problem of improving deterministic policy gradient, learning better action-value gradient, and model based policy optimization are all important.
This paper proposed a whole framework and the experiments look promising and improve over TD3 method.

__ Weaknesses__: (1) The theory is not convicing. Proposition 3.1. is used to show that "the norm of the action-gradient of the policy evaluation error instead of its value that should be minimized". However, this is because the specific choice of the objective on the l.h.s. of this Proposition 3.1. Just saying "minimizing the TD-error does not guarantee that the critic will be effective at solving the control problem" is not convincing to me that this proposed objective is "a better objective function for critic learning". The author should show that minimizing the TD-error is not a good choice rather than just make plain claims.
(2) The design of Eq. (7), together with the statement "minimization problem in Equation 6 is hard", looks to me is a contradiction with the claim made after Proposition 3.1. It looks to me lead to the conclusion that "TD error is important". And the authors do not have an explanation what is going on, and which one (TD error or gradient norm) really matters here. If just using TD or gradient norm did not work, then the conclusion should be both of them are important, rather than "the value gradient is a better objective than TD error".
(3) The experiments look promising. However, first, it is for relatively simple tasks in Mujoco (not including difficult tasks like Humanoid). Second, the authors claim that "there is no intrinsic advantage in terms of sample-efficiency for model-based reinforcement learning" by using comparing TD3 and Dyna-TD3. This conclusion to me is just too hasty and not convincing. To make the (big) conclusion that "model-based RL has no advantage in terms of sample-efficiency than model-free RL", I think at least many more algorithms (of course just TD3 is far from enough) and other ways of learning and using models should be conducted and compared. Thus the subsequent claims of the performance is from the algorithmic design are not convincing to me.
=====Update=====
Thanks for the rebuttal. I agree with other reviewers that the Proposition 3.1 makes sense since critic learning is for providing a better policy gradient. Thus I would increase my score to 5.
However, I still consider line135-138 and Appendix B.1, i.e., using action-gradient will fail, as an unclear point to me and somehow it makes the main point of this paper (using action-gradient should be a better choice for critic learning than TD-error) not that trustworthy. The authors said it is because of "local optima". And it is not clear why combining TD-error and action-gradient together will not suffer the same issue. Since the main point is to claim the importance of the action-gradient, this unclear point also seems an important issue.

__ Correctness__: The claims are made not in a convincing way to me. It seems more investigations are needed to make things clear.

__ Clarity__: The written and presentations are clear.

__ Relation to Prior Work__: The related work discussion is thorough.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: *** post rebuttal***
I maintain my score for weak acceptance.
I do believe the estimator is not novel (the framework of "Credit assignment techniques in stochastic computation graphs" is the same as MAGE, i.e. providing valid and lower variance estimator of policy gradients, and the very same estimator is presented as an example in that paper. All papers I mentioned also cite Werbos, but that's not the point I was making).
But the focused investigation of this particular estimator, along with the numerical experiments, and the theoretical result, justify the acceptance of the paper in my eyes.
Two additional points I wanted to make:
- the paper claims that learning a critic then computing its gradient is not as principled as directly learning the value gradient, since the policy gradient error will be dominated by the error in the value gradient. This is true, but the authors need to highlight and make it very clear this is not true for policy gradients in general, but for reparametrized (e.g. deterministic) policy gradient in particular. The error of a reinforce-like estimator would be driven by the TD error. While powerful in practice, it is not always true that DPG estimators have lower variance than 'regular' PG ones, so it would be best to make it very clear that the statements only apply because the authors consider a DPG-like scheme.
- While it is true there is no particular reason to trust the gradients obtained by differentiating a critic which was learned by minimizing the critic error, it is also not clear why we should trust the gradients of the model, which will typically be learned by minimizing the model error.
***
The paper studies the problem of learning a critic used for a DPG-style algorithm (differentiating through the critic to provide a gradient to the continuous action). The authors note that the traditional approach (regression of the critic against a valid target, then differentiating it) does not guarantee that the critic is useful, as it's the gradient of it we are interested in. They show the error in DPG is related to an error between between the true Q function gradient and the critic gradient. Since a target for the gradient of the Q function is not immediately available, they suggest building one by using a one-step model and a bootstrap argument.

__ Strengths__: - The paper is clearly written and gives good background on an important problem in DPG-style algorithms.
- The proposed approach is mostly sensible and is relatively simple to implement (in problems where state is well known).
- Good experimental results on standard continuous controls benchmarks.

__ Weaknesses__: - The novelty is somewhat limited: the challenge of learning the value gradient is mentioned at least in [1], and the technique to provide a valid target for the value gradient in [2] and more generally in [3] (theorem 6, specialized to the approach of interest on page 26, 'for instance a one-step gradient critic..'), the use of Sobolev norm is suggested in [4].
- The paper suggests replacing the evaluation error \delta (which is inaccessible unless using whole-horizon model rollouts) by the TD error \hat(\delta). While this is probably only a notation error, as the way it is presented in the paper, the loss appears invalid.
In TD-learning for learning a Q function, the target can be written as r+Q(x',a'); however, when using it inside an L2 loss, it is important to not differentiate into the target (this is often implemented using a stop_gradient, i.e. (Q(x,a)-stop_grad(r(x,a)-gamma Q(x',a')))^2.
The same exact issue happens when learning value gradient: the target dr/da + dx'/da dQ(x',a')/dx' is a valid target, but should not be differentiated into (note it depends on the optimization parameters through Q).
[1] Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies, Balduzzi et. al
[2] Value gradients, Fairbank
[3] Credit assignment techniques in stochastic computation graphs, Weber et. al
[4] Sobolev training, Czarnecki et. al

__ Correctness__: Yes

__ Clarity__: Yes

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__: The approach for building a valid target for the value gradient can be generalized to a k-step rollout (model for k-step, then bootstrap), or even a mix with a TD(lambda) like approach for value gradient. It would be interesting to see if deeper rollouts help out in some environments. Similary, the (1-step, or k-step) model-based target can directly be used instead of dQ/da, as suggested by [1], [2]. It is worth mentioning this is the case since there is a strong methodological connection between both (a valid bootstrap target can always be used to learn a critic/gradient critic or as replacement critic/gradient critic in a policy gradient scheme)
[1]Imagined value gradients: Model-Based Policy Optimization, Byravan et. al
[2] Model-augmented actor-critic: backpropagating through paths, Clavera et. al

__ Summary and Contributions__: In this work the authors propose a new actor-critic learning rule which they call Model-Based Actor Critic (MAGE). Specifically, they focus on improving upon the deterministic policy gradient methods in continuous control problems. While previous methods involved obtaining a critic using Temporal Difference learning and then updating the actor by computing the gradient of the learned critic, this work proposes a new approach which directly attempts to learn the critic gradient by employing a learned model. The authors show that such an approach is theoretically grounded and it leads to more data efficient learning.

__ Strengths__: The authors back the claims of this new approach both from a theoretical and experimental standpoint. Their method is appealing, since explicitly optimizing for the quantities that matter usually leads to better performance. This work also spends enough time examining the different parts of the proposed method and tries to provide sufficient intuition for the experimental results. The paper is well structured and written.

__ Weaknesses__: The one weakness of this work, was that the authors did not spend any time investigating the role of model accuracy in the performance of the learning rule. Obtaining an accurate environment model can be a really computationally expensive and challenging task in many interesting environments, and investigating how a poor model can affect the learning process would be a great addition to this work.

__ Correctness__: Yes

__ Clarity__: The paper is well written.

__ Relation to Prior Work__: There is adequate discussion of previous contributions.

__ Reproducibility__: Yes

__ Additional Feedback__: It would be useful to provide quantitative results regarding the quality of the model. For example:
How good are the samples produced by the model ?
How does the quality of the model affect learning ?
How does the method change when a deterministic model is used instead?