Reviews: Interval timing in deep reinforcement learning agents

After reading the Author Feedback: The authors addressed and responded to all my concerns in an extensive manner. This is an interesting well-thought contribution, and I am happy to increase my score. Summary: In this paper, the authors investigate how deep reinforcement learning agents with distinct architectures (mainly, feed-forward vs. recurrent) learn to solve an interval timing task analogous to a time reproduction task widely used in the human timing literature, implemented in a virtual psychophysics lab (PsychLab/DeepMind lab). Briefly, in each trial the agent has to measure the time interval between a "ready" and "set" cue, and wait for the same duration before responding by moving their virtual gaze inside a "go" target; with the goal that the duration between the "set" cue and the "go" response should match the interval between "ready" and "set". Time intervals during training are drawn from a discrete uniform distribution. Perhaps not surprisingly, recurrent LSTM networks can learn and perform the task well, and their performance generalizes to unseen time intervals, both via interpolation and extrapolation. Interestingly, even feed-forward networks, which lack an explicit memory, are able to perform the task, although with degraded performance. The authors analyze the networks to understand their learnt solutions, finding the representation of a timer in the activation of LSTM units. On the other hand, the feedforward network encodes time in the gaze trajectory itself, a form of "auto-stygmergy" (that is encoding information in the environment, in this case the agent's gaze location). Originality: Medium. The investigation of internal representations of (deep) neural networks is becoming more and more common in computational neuroscience, and as far as I can tell there is nothing specifically new in the methodology here; also, previous studies have used neural networks (more or less biologically-inspired) to understand the development of representations of time in the absence of an explicit clock (notably, Karmarkar & Buonomano, 2007; a seminal reference which is missing here). Quality: High. The experiments and analysis are well-thought and sound, and the authors explore the robustness of their findings with several architectural variants. Clarity: The paper is well-structured; the text is well-written and very clear, and the images provide useful information. Significance: This is an interesting contribution potentially both for the human timing field and for machine intelligence (in that reinforcement learning agents, for most tasks of interest in the world, have to use some representation of timing; so it may be useful to understand how that develops). Major comments: This is a solid, interesting paper and a pleasure to read. I congratulate the authors for making their experimental setup available. For the purpose of reproducibility, I encourage them to also release the code used to run the analysis (whatever they can; I understand that they have limitations as noted in the reproducibility checklist), and possibly the trained networks (at least the two main ones used in the paper; not necessarily all the variants). My first comment is that the current setup does not enforce the subject to hold fixation (as most if not all psychophysical tasks involving eye movements would require the subject to do). Clearly, the neural network agents (at least the feed-forward one) are exploiting this freedom to perform the task. As a simple control experiment, the agents should be required to keep fixation within a small box surrounding the fixation cross, until the "set" cue goes off. I imagine that the prediction is that the feed-forward agents would become (almost completely) unable to perform the task; while the recurrent agent should be unaffected. More importantly, one of the results of the paper is that for the feedforward agent the standard deviation of the production interval scales linearly with the sample interval ("scalar variability"), which is taken as a signature of "biological" behavior, known in many fields as Weber's law (the main alternative would be a square-root scaling, which is a signature of counting); see lines 131-136 and Figure 4. First, the claim of linear scaling itself is somewhat dubious - as far as I understand the main evidence is presented through Figure 4a, but no statistical analysis is provided. At the very least, I suggest that the authors fit a generalized power law (e.g., a + b x^c) to the data, get a posterior over the parameters, and check that the coverage of the posterior over c is only around 1 (while I suspect that the data could be also fit well by a lower exponent). The authors also fit a model with scalar variability to the network-produced data, but a qualitatively good fit (no quantitative metric or comparison is provided) is hardly strong proof due to the well-known lack of identifiability of these kinds of models (Acerbi et al., 2014). Second, the reward function used by the authors explicitly includes a scalar term (that is, the correctness window is proportional to the sample interval, with a beta coefficient, see line 75), so it is unclear whether the scalar law potentially seen in the data (if it is there in the first place) emerges naturally, or is simply a byproduct of the chosen reward structure. The authors might test this by changing the reward function such that it uses a fixed, interval-independent window. Minor comments: line 78: The authors set beta to "8 frames", but beta is a coefficient that multiplies the time interval t_s, so I am a bit confused about the dimensionality; beta should be an adimensional scalar. Discussion: A couple of missing citations that may be relevant as related work. First, Karmarkar and Buonomano (2007) explore how neural networks can give rise to a sense of timing without an explicit clock. Also, a recent paper investigated how different task representations emerge in recurrent neural networks during short-term memory tasks, which might be worth mentioning (Orhan and Ma, 2019). Typo: line 94: followed max --> followed by max References: Acerbi, L., Ma, W. J., & Vijayakumar, S. (2014). A framework for testing identifiability of Bayesian models of perception. In Advances in neural information processing systems (pp. 1026-1034). Karmarkar, U. R., & Buonomano, D. V. (2007). Timing in the absence of clocks: encoding time in neural network states. Neuron, 53(3), 427-438. Orhan, A. E., & Ma, W. J. (2019). A diverse range of factors affect the nature of neural representations underlying short-term memory. Nature neuroscience, 22(2), 275.

Quality: I found the basic idea and result interesting, but also found the understanding of this behavior to be somewhat lacking. For example, while the timing of the agent is well fit by a Bayesian model, it is not clear why the agent would arrive at this strategy and whether it arrives at this strategy for the same reasons as biological systems. Clarity: I found this paper fairly clear and easy to understand. The basic results are clearly presented, and the authors are making their code available online. Originality and Significance: I would classify this work as original, although I’m undecided as to its significance. While it is fairly intuitive that less powerful architectures would display biases away from the optimum strategy, it is interesting to see how these biases agree with those found in biological systems. I do, however, have questions about the underlying mechanisms for these biases emerging. For example, does there not exist a set of network weights that works for the simple strategy, or does the strategy emerge and become frozen from the learning dynamics. If so, how is this strategy learned? Miscellaneous:The PDF file seems to be broken. My computer crashed several times due to looking at either page 3 or page 4 of the PDF (Perhaps figure 3), and I was not able to print these pages either. I don’t know why this is.

Paper ID:	3627
Title:	Interval timing in deep reinforcement learning agents

Reviewer 1

Reviewer 2

Reviewer 3