__ Summary and Contributions__: The paper introduces a novel framework to infer the internal model (both reward function and dynamics).

__ Strengths__: Their method is a novel in the sense that it is the first model to solve this problem. They combine several methods and apply them to this novel setting. The framework introduced is generalizable to many tasks. It’s a nice methodology paper. They demonstrate the effectiveness by recovering the internal model of the simulated agents. The code is shared, making it easy for other researchers to apply their method. I think it’s relevant to the board NeurIPS audience.

__ Weaknesses__: It is clear that there’s no baseline model comparison as there exists no other model. The author also discusses the model can be directly applied to a simpler setting. It can be good to compare other models in a simpler setting, which also shows it can be generalized to other settings.
--- Updates ---
I appreciate the authors addressed my comments on sample efficiency. The results look good! Additional work with real dataset will make this paper much stronger and more impactful, which would be a nice plus for this paper. I'm keeping my score though I feel like this is a borderline paper without results on at least one real data.
The model is aimed to provide a way to recover animal internal models, providing more insights into animal behavior. It would be interesting to apply the model to real dataset(s), to see what it can recover in addition to other existing findings.

__ Correctness__: They are correct to the best of my knowledge.

__ Clarity__: The paper is very well-written and I found it easy to follow.

__ Relation to Prior Work__: The author extensively discussed some previous work and clearly pointed out how their work is different and novel compared to existing literature. I found their contribution is clear to me.

__ Reproducibility__: Yes

__ Additional Feedback__: I like this work, and I’m excited to see the new discovery using this method in animal/human behavior.
I suppose to reliably recover the internal model, some minimum number of samples would be required, e.g. number of trials. It would be interesting to show a figure of how accurate the parameter recovery is as a function of the number of data points. It can also help experimentalists design their experiments properly if they are interested in recovering the internal model. There might be other constraining factors that experimentalist needs to take into account in order to use this model effectively, I would be interested to see a discussion.

__ Summary and Contributions__: The paper proposed inverse rational control (IRC), a framework and method for simultaneously inferring an agent's reward function (as in inverse RL and rational analysis) and internal model (as in inverse optimal control) under a limitation on the agent's belief computation modeled as an extended Kalman filter (as in cognitively-bounded rational analysis). The paper demonstrates this method on simulated data generated from an artificial agent operating in a simple neuroscience task.

__ Strengths__: The overall ideas of a bounded agent (approximately) optimally maximizing an unknown reward under unknown constraints as a model for a behaving human / animal have a longstanding history in psychology/neuroscience/AI going back to Simon (as the paper reminds us). But combining all of the various possible limitations (model mismatch relative to the true world model, unknown reward, unknown dynamics) in a single framework is novel to my knowledge, in large part because this usually lands the researcher in a hard and under-constrained bilevel optimization problem with the lower level optimizing policy parameters w.r.t. reward and the upper optimizing fit parameters to data. The paper's proposal to address this issue by feeding the upper-level parameters as contextual variables into the lower-level policy optimization's state and thus solve the lower level problem only once (vs for every setting of the upper-level problem) is clever and elegant.

__ Weaknesses__: The specific empirical evaluation chosen is the primary weakness of the paper. From a neuroscience perspective, the validation of parameter recovery on synthetic data is a necessary first step, but not a sufficient one. Given that [a] the task is primarily of neuroscientific interest and [b] a simpler (though also bayesian belief-updating) fit model is given in the cited prior work, the lack of comparison of cross-validated performance against that prior model is surprising. We should either see better cross-validation performance to the models in prior work, or similar performance but more insight / explanation of the underlying mental computation. This would show us a real payoff of the new insights here. Practically, I'm also concerned with whether this method is sample-efficient enough to address real neuroscience data, which tends to be small by modern ML standards.
Outside of the neuroscience perspective, the fireflies task may be more interesting than canonical toy grid worlds, but is still far from realistic relative to application domains in robotics, self-driving cars, etc.
== Post-Rebuttal Update ==
I have read through the other reviews and rebuttal. I appreciate the additional information about sample efficiency, and take the rebuttal's point that recovering the prior model would be more of a sanity check than a serious comparison because the prior model should be trivially recoverable in the current setup. At the same time, following discussion with the other reviewers I still think that the lack of empirical validation weakens the paper. I appreciate the difficulty of obtaining real neuroscientific data from humans or animals, but it doesn't change the fact that the neuroscience community is unlikely to engage with the work without empirical validation, and the RL contribution here is not significant by itself. Consequently, I am moving my rating up but I still think this is a borderline paper.

__ Correctness__: Yes, as far as I can tell.

__ Clarity__: Yes, and it was enjoyable to read.

__ Relation to Prior Work__: For the most part, yes. Some closely related work on the psychology side more recent than Simon 1972 is missing (including from Anderson, Griffiths, Lewis, etc), but otherwise the connections across both the ML and neuroscience sides are discussed.

__ Reproducibility__: No

__ Additional Feedback__: For reproducibility: details are lacking on the specifics of the synthetic data experiment in terms of the synthetic agent parameters, as well as various details of the model parameters and hyperparameters -- there is not enough here to replicate from the paper or supplement. A github link to the code is provided in the supplement, which is good, but I did not click on it since I was concerned that it would break anonymity.

__ Summary and Contributions__: This paper considers the inverse rational control problem, which I summarize in my own words as follows: an RL agent interacts with a partially observable MDP using its believed optimal policy, which is induced by its own biased model prior, reward prior, hypothesis class, etc -- all of these internal (and therefore not directly observable) parameters are summarized notationally by theta. An experimentalist can only observe the action taken by the agent as well as the complete state and want to infer theta.
The main contribution of the paper is three-fold: the authors first propose a way to learn an approximately policy in a POMDP using a belief based actor-critic variant; they then use MLE-like approach to inter theta, they finally demonstrate the effectiveness of their methodology on a real-world scenario.
___________________________________________________
Post-rebuttal update: I kept my score unchanged because I still think this paper is a intesesting combination of ideas. However, in order for this paper to be out of the borderline range of most reviewers, more serious experiments are definitely needed.

__ Strengths__: I find the paper a blending of several interesting ideas aiming at a very interesting problem. While POMDP certainly has been studied for a long time, both theoretically and emprically, and the method proposed by the authors may be far from being the state-of-the-art, it is based on reasonable observations and intersting ideas, and is very suitable for the scenario they consider -- the inverse rational control problem.
The problem itself may not have received a lot of attention in the RL community (and it is the first time I have ever heard of it), but for me it looks interesting enough and is a problem worth studying. There are in fact many non-trivial aspects of this problem: for example, how to characterize the belief, how to combine belief estimation and belief based value estimation into the same pipeline, etc., and the authors have addressed most of them.

__ Weaknesses__: While I am buying the story, this paper certainly could use some improvement in the experiment section, especially if the authors want to get more attention from the RL community. Adding at least one standardized environment would most likely make the story more complete.

__ Correctness__: Yes.

__ Clarity__: Yes.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: