NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4837
Title:Better Transfer Learning with Inferred Successor Maps

Reviewer 1

It is a very interesting and timely study which applies an intuitive (though non-trivial) idea of using multiple successor representation maps in reinforcement learning and adjudicating between them based on evidence coming from the environment. This is very relevant for understanding human and animal behaviour in complex environments with changing task conditions and reward contingencies. As this (and digital phenotyping more generally) gain increasing popularity, modelling and understanding these processes has increasing importance. Although the idea is fairly straightforward, I believe it has not been done before, hence the study is original. Literature is reviewed properly, appropriate and interesting analyses are performed, hence quality and significance are high as well. The greatest weakness of this paper in its current form is clarity, which hopefully can be improved, as although successor representation is an increasingly popular area in RL, it's also fairly complicated, hence it needs to be well explained (as e.g. is done in Gershman, J. Neurosci 2018). I also have a few more technical comments: - It's not exactly clear where is reward in section 2. Tabular case and rewards being linear in the state representation are mentioned; however, how exactly this is done should be explained more explicitly (or at least referred to where it's explained in the supplementary information - currently SI has information about parameters, algorithm and task settings details, but not methodological explanations) - It is mentioned that the mixture model is learned by gradient descent - it would be nice to see further discussion about how exactly this is done and why that is biologically realistic (as gradient descent is not something typically performed in the brain). - It would be nice to see not only summary statistics, but also typical trajectories performed by the model (and other candidate models) at different stages of learning - It is mentioned that epsilon = 0 works best for BSR, but in section 4.2 it's stated that for the puddle world epsilon = 0.2 was used for all models - why is that? Normally when comparing different models/algorithms, effort should be taken to find the best performing parameters (or more generally most suitable formalisations) for each model. - What exactly is the correlation coefficient in section 5.1 (0.90 or 0.95) between? - In Fig. 4, is it possible that GPI with noise added could reproduce the data similarly well or are there other measures to show that GPI cannot have as good fit with behavioural data (e.g. behavioural trajectories? time to goal?) - Finally this approach seems to be suitable for modelling pattern separation tasks, for which there is also behavioural data available - it would be nice to have some discussion on this. - There are a number of typos throughout the paper, which although don't obscure meaning should be corrected in the final version.

Reviewer 2

The paper presents a novel idea, is overall clearly written, and presents an interesting contribution. However the paper falls short in the following aspects which need to be addressed before publishing. Figure 2 shows experiments where the reward function changes every 20 trails. According to my reading, the “Single SR” baseline experiment is almost identical to the transfer experiments presented by Lehnert et al 2017 ( Why does the performance of at least the “Single SR” baseline not degrade right after a signaled reward change? Rather than reporting cumulative steps, the results would be much clearer by reporting per episode steps or per episode returns and comparing the actual convergence rates of all tested algorithm. When making these performance comparisons, which learning rates where tested? Was a grid search pass performed? Performing a gridsearch pass over a range of learning rates, exploration settings, etc. is important for ensuring reliability of the presented results. This should be done at least for the tabular experiments. Further, it would help to benchmark/compare simulations that do not use the CR-map rewards and analyze how exploration degrades/changes. I think the idea of convolving the reward map is interesting, but according to my reading the paper does not empirically support why this is necessary. Other related papers studying the dependency of the SR on a policy at transfer are and These two papers should also be discussed, as the submission attempts to improve over these previous findings.

Reviewer 3

I'm very concerned with the clarity of the paper. Many notations are used without definition and the method is not clearly described. For example, w(s) in L98 and phi(s') in L101 are used without definition, and the same notation \alpha seems to be used to refer two distinct variables (L101 and L148). Figure 1 includes notations such as CR_3 and H, but their definition is missing in the main document. While the paper provides a generative model for the successor map very briefly, the detailed method for the inference and the reasoning behind it is missing. Because of this insufficient description, it is very challenging to correctly understand the algorithm and assessing the novelty or significance of the proposed method. Is \alpha in L101 identical to the \alpha in L148 conditioning reward only on the arrival state s' seems unrealistic. w(s) in L98 and phi(s') in L101 are used without definition The main algorithm is in Appendix * After author response * I appreciate authors for addressing the clarity issues in the author response. The additional description in the author response was helpful in understanding the notations and algorithms. I hope the authors revise the final version to include the details that are missing in the current version. Now I understood the method and increased my rating. I agree with other reviewers that the proposed method is intuitive and not studied before, so worth being presented in NeurIPS.