The paper investigates how, in a diffusion process, the term accounting for the influence of the whole past, can be modelled in terms of temporal convolution and approximated via a recurrent neural network. The AC thinks that this topic is fully relevant to NeurIPS, and would be of utmost interest for an (admittedly small) fraction of the audience. However, the authors must make every effort to make their work accessible (even for statistical physicists; the Mori-Zwanzig fomalism is perhaps not as well known as the authors think, to say the least) and to thoroughly show how scalable the approach is compared to alternatives. The AC dearly hopes that the authors will invest in the pedagogical and writing efforts required to make their work known to the ML community. =========== Additional emergency review In this paper the authors predict an epidemic process on an unknown network by combining the Generalized Langevin equation (GLE) based on the Mori-Zwanzig formalism, together with deep learning. The context is typically the propagation of information (eg a tweet) on a social network, and the method assume fully observed cascade data, but not necessarily the network structure. This problem is difficult, especially if the network is not known. The originality of the approach lies in the combination of the GLE formalism which involves a memory kernel, with recurrent networks. The idea being to learn a stochastic equation which involves a vector of single node infection probabilities and where the effect of all the complex nodes interactions induced by the unknown network are encoded into the memory kernel thanks to the Mori-Zwanzig formalism. A crude approximation allows to learn the memory kernel in the form of a recurrent network from the data making the model easier to learn than a brute force LSTM based on cascade data. The latent state of the recurrent network is here simply obtained by applying a delay kernel to the previous configurations. Also the parameter learning of this model (i.e. the memory kernel and the underlying network parameters) is presented with an original point of view thanks to a reformulation in terms of an optimal control problem. I find some merit in establishing (i) a bridge between the seemingly far apart domains of stochastic differential equations and deep learning, (ii) a connection between parameter learning and optimal control which suggests some connection of the procedure with reinforcement learning; I found elegant the way the complexity of the process is summarized mainly in a delay kernel K(t-s, w) in (10) which then determines the latent state of the recurrent network. I found the numerical tests convincing, as they significantly and consistently improve on state of the art methods on various datasets to predict infected nodes but also to recover the network structure Albeit difficult to read because of the diversity of the material involved, I found this paper very interesting and stimulating. The approach seems quite general and possibly transferable to other types of processes, typically partially observed Markov processes. Remarks: I found a bit misleading the introduction of variables e_I in (5) since closed form and simpler equations can be obtained only in terms of variables x_I(t)