Review for NeurIPS paper: R-learning in actor-critic model offers a biologically relevant mechanism for sequential decision-making

NeurIPS 2020

R-learning in actor-critic model offers a biologically relevant mechanism for sequential decision-making

Review 1

Summary and Contributions: The authors collected behavioral and neural data from mice performing two foraging tasks. They compared R-learning and V-learning model on the data, and found that R-learning can explain the behavioral data better, while the neural data is consistent with the TD error of both models. They also showed analytically how their model is related to the optimal policy, and extensively discussed how their work can be related to some other work in the field.

Strengths: The paper is a good combination of neuroscience, machine learning, and theory, which I found all of them are solid and nicely done. I think it’s relevant to the board audience of NeurIPS.

Weaknesses: It would be good to compare some other simple, non-deep baseline models, and show if the experimental finding is only recoverable by deep R-learning. The neural data is consistent with both models, so it does not strongly support R-learning. It would be interesting to see if there’s any neural substrate that strongly supports R-learning but not the other models. The authors claim that their work links RL, MVT, and Bayesian inference approach. It was clear how their work link RL and MVT, but I wasn’t sure how it is related to the Bayesian inference approach. The author did discuss a Bayesian model in the discussion, but it would be good to maybe perform model comparison. --- updates --- I appreciate authors' effort into addressing most of my concerns. so I'm increasing my score. The clarity of the paper is good, but if accepted, it would be good to have one more iteration on the clarity. And the results of alternative baseline models can all go into the supplementary material.

Correctness: They are correct to the best of my knowledge.

Clarity: The paper is very well-written and I found it easy to follow. I appreciate the effort. I think R-learning first appears in line 73 without any introduction. Is the R-learning using in this paper the same as Schwartz, 1993? Then I think it would be good to cite it at the first appearance, and with a brief introduction. It wasn’t clear how you model the switching cost in your model. The training schedule is not very clear to me. For example, did you train the model separately for different initial values in task 1? And did you train the model separately for task 1 and task 2? Line 34-36 in the supplementary material seems to indicate some sequential training, that you train the model on task 1, and then on task 2, is that right? Line 301, “is” -> “in” dynamic real-world environment

Relation to Prior Work: The authors mentioned in the paper that their task 2 is a novel experimental design, which is not. For example, in Constantino & Daw, 2015, which is cited by the authors, they also have the same initial values in experiment 1, and a random initial value in experiment 2. So it wasn’t clear to me the novelty of the experimental design. And it wasn’t clear if their models are novel or established, and how their results and methods compared to Constantino & Daw, 2015. It would be interesting if the authors can further discuss experimental and modeling results compared to Constantino & Daw, 2015, which they found that human subjects overharvesting and also considered an R-learning model and other TD models.

Reproducibility: Yes

Additional Feedback: It is an interesting finding that the mice under-harvesting the current resource, do you think it is related to the short travel time between port, i.e. the switching cost is relatively low? The authors discussed that the R-learning model could offer a mechanism for learning to make these decisions (line 269-175). I agree that based on the findings of this paper, it seems R-learning captures animal paradoxical escape better, but is this behavior beneficial? It is valuable to understand animal behavior, but it might not be valuable to implement such model in the real-world application if the model leads to less accumulated reward. There are some claims in the discussion that seems to be not well-supported, e.g. line 298-300, and line 317-320. It is nice to provide insights and speculations, but it would be good if the authors can make it clear whether it is already supported by any result, or it is a possible hypothesis for future work. Also, the author noted that the TD errors in V- and Q-learning are different during training before they converge to an optimal policy. It would be interesting to look at neural recordings during training.

Review 2

Summary and Contributions: This is an impressive and timely study that suggests a reference point-based framework for (deep) reinforcement learning and shows how various results from rodent foraging tasks could be reproduced better than using classical RL approaches. The paper provides a highly desirable link between behavioural economics and reinforcement learning communities and is strong on both theoretical and empirical aspects. However, more could be done to tease out the differences between V and R learning (especially with regard to VTA activity) and integrate with results with prospect theory/reference point literature, which is vast, even if it does not normally employ RL. More attention should also be paid for parameter estimation and their interpretation.

Strengths: The paper provides both theoretical and empirical contributions, connects RL with behavioural economics (more implicitly than explicitly) and provides some neural evidence that VTA responses are not inconsistent with R-learning model predictions. The findings have huge potential of significance to computational cognitive neuroscience community and to NeurIPS. Although the idea of reference point-based valuation is not novel, the theoretical framework of R learning is (even if reward rate has been used in models of effort or motivation).

Weaknesses: More attention should be paid for teasing out differences between V and R learning, with intermittent initial rewards being essentially the only example. Although it is impressive that new VTA recording data is presented in the paper, I don't feel that the result is particularly helpful - it only shows that VTA activity doesn't contradict R-learning model, but it does not really provide specific support for it. It should be possible to design different tasks/protocols under which the two formalisations would have substantially different TD errors, which could help tease out biological correlates of the two models. Furthermore, it would be nice to see more details of parameter estimation and the resulting best-fitting parameter values, which if done properly, may allow to achieve not only a qualitative but also a better quantitative fit between Fig. 1E and Fig. 3D (as well as between Fig. 1D and Fig. 3B). As the models have multiple parameters substantially affecting performance, the two models should be compared under best-fitting parameters and should include formal measures like AIC, not just qualitative fits. Of course model universality regardless of parameters is helpful, but quantitative fit is equally important. Finally integration with behavioural economics literature should be improved.

Correctness: Although I didn't check details of all proofs in supplementary material, the results look intuitive and correct, and the overall methodology is sound.

Clarity: The paper is clearly written and only has a few typos (in lines 302 and 305).

Relation to Prior Work: Although relation to prior work is discussed reasonably well, there could be better integration with behavioural economics and neuroeconomics literature on reference point-based valuation, which is pretty vast. It's also important to point out that the overall idea of reference-based valuation is not novel, although the theoretical framework is. It would also help mentioning effort and motivation models that use similar formalisms and are relevant here (e.g. animals may switch faster due to boredom or fail to switch due to increasing fatigue). I also think that discussion is somewhat repetitive and the saved space could be used to expand on my suggested aspects. Although discussion of connection to model-based vs model-free RL is helpful, the authors could go even further to suggest that the model-based vs. model-free dichotomy is unnecessary and should be avoided, especially with the help of the results and theoretical framework developed in this study, but not only (e.g. see https://papers.nips.cc/paper/3311-hippocampal-contributions-to-control-the-third-way.pdf )

Reproducibility: Yes

Additional Feedback: I thank the authors for promising to address my concerns such as quantitative fits and integration with behavioural economics literature. I hope motivation models could be mentioned as well, as they are also relevant. I agree that the recordings result is still useful, it's just important to point out its limitations.

Review 3

Summary and Contributions: This paper shows that R-learning models animal foraging behavior better than V-learning in a novel task where rewards per patch deplete over time and the animal must decide whether to stay or switch to a different patch. A further experiments showed that the behavioral prescription of the marginal value theorem fits better when you work it out for R learning than for V learning. In their final experiment, they recorded from dopaminergic neurons in VTA, but apparently found either V or R learning explained the results equally well as one other.

Strengths: The experiments are well designed and clearly explained.

Weaknesses: I remain unconvinced that the difference between whether neurons are better fit by the TD errors of V learning or the TD errors of R learning is terribly significant for neuroscience. Surely the neurons do not actually compute either quantity. These are just convenient abstractions. Maybe there is something I'm missing about the theoretical importance of R learning that the authors could help me to see with their reply?

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: Thanks for the reply to my question in the rebuttal. I found it helpful.

Review 4

Summary and Contributions: This paper uses a combination of experimental and analytical analysis to study the behaviour of animals in foraging tasks. The main point made by the paper is that when the initial reward of available resources is random, animals tend to show a sub-optimal behaviour and leave higher reward resources sooner than the others. The authors then use a combination of model simulation and analysis to explain this behaviour.

Strengths: I enjoyed reading the paper. It is well-written and provides interesting insights into the potential explanation of animal’s behaviour in foraging tasks. I also liked the analytical approach taken here to explore general properties of the model.

Weaknesses: The effects in Figure 3D are very small and different from the data reported in Figure 1E. In particular, the threshold to leave the port seems very similar across the conditions in Figure 1E. Although the neural data are not inconsistent with R-learning, they don’t support R-learning either. In fact, based on the high correlation between the error signals in R and V learning, it seems that the average reward is almost constant, which is inconsistent with the proposal here about the role of average reward in explaining the behaviour in the intermittent initial rewards. ============ after rebuttal ==================== The authors didn't comment on the match between model and data, so I still have the concern based on which I'll keep the score.

Correctness: I haven't checked all the proofs in supplementary materials, but had a high level look and they look sensible to me.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Minor: In Equation 9 a “sign” operator is missing on the right side. It is mentioned that the error signal is used to train the network, but it is unclear how the error signal was used to train the network (specially the actor network).