NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:5434
Title:Temporal FiLM: Capturing Long-Range Sequence Dependencies with Feature-Wise Modulations.

Reviewer 1

Two or three relevant citations: Transformer models should probably be mentioned in the section on "models designed specifically for use on sequences", since they are competing heavily with the referenced baselines on NLP tasks especially. I believe your numbers on the Yelp dataset compare very favorably to the "sentiment neuron" work from Radford et al - that could be a nice addition and add further external context to your results. However it also worth noting that the ngram results from (which is also a good citation) are quite strong, and also worth noting in the results. Some questions about the architecture, particularly the importance of the "additive skip connection" from input to output - how crucial is this connection, since it somewhat allows the network to bypass the TFiLM layers entirely? Does using a stacked skip (with free trainable parameters) still work, or does it hurt network training / break it completely? What is the SNR of the cubic interpolation used as input for the audio experiments? In other words, what is the baseline SNR of the input, with no modifications? What is the cutoff frequency of the order 8 filter for each super-resolution task? The order is relatively less important than the remaining powerin the "aliasing band" after filtering, which would be related to the cutoff frequency. Specifically for evaluating audio super-resolution, it would also be nice to show numbers for other related work for example , and some other papers such as DRCNN as well as other followups to DRCNN. You cite DRCNN, but don't comapre directly to it from what I can tell. The numbers here appear competitive with what I have seen in the literature, but it would be nice to ground the numbers with some other publications (while also paying attention to whether the datasets match etc. with the publication versions). Given the close relation of the audio model to U-Net, one of the audio U-Net type papers (such as would potentially be another strong baseline. As a general note, I am not convinced of the usefulness of SNR as a measure for downsampled audio. Because we know that aliasing implies that *any* sinusoid which matches at the samples points is a possible correct result, the potential mapping from downsampled audio -> upsampled audio is one-to-many, meaning that a result which sounds OK, could have bad SNR compared to the reference. The authors do a MUSHRA test, which is great - maybe this should be included in the main body of the paper, rather than the appendix. AB, ABX, or MUSHRA type tests are better choices in my opinion for testing the quality of the audio upsampling models. This could also be improved by testing against known recent models in the literature, rather than simple baselines of spline and DNN For the genomics experiments, it is again hard to contextualize the correlation numbers, and what they mean for end applications. Are there other models in the literature on this task that you could list in the table, beyond the simple baselines listed? As it stands, I cannot really evaluate the numbers beyond "TFiLM seems better than some CNN, and some LSTM". Having a published result to compare to would again make me more convinced of this result - as it stands this experiment neither helps nor hurts my score of the paper. You mention specifically "can be run in real time" - are there any examples of this? What is the latency you would consider "real time", and what are the primary factors that prevent it from running in real-time today, if any? The last sentence of the conclusion stood out to me - "applications in areas including text-to-speech generation and sentiment analysis and could reduce the cost of genomics experiments" could be better worded as "application to text-to-speech generation, sentiment analysis, and cheaper genomics experiments". Overall, this paper shows strong performance on several benchmarks, and clearly explains its core methodology. The appendix contains a wealth of useful reading, and overall I liked the paper. My chief criticism is in regards to referencing other publications for baselining and performance, as simply testing against your own CNN / DNN / LSTM is not particularly convincing - I had to dig up external references to contextualize the numbers. Having the context directly in the paper would really strengthen the quantitative aspect of this work in my opinion. I hope the authors do release code in the future (as mentioned in the appendix), as this work seems like something the community could build upon in the future. The method described in this work seems straight-forward, and it would be interesting to apply it to a broader range of sequence problems. POST REBUTTAL: The authors addressed my primary concerns with baselining performance clearly with related papers in the results tables, as well as the other critiques of the reviewers, so I improved my score slightly. Overall, I like this paper and will look for implementations (hopefully using author-released code) used in new areas in the future.

Reviewer 2

This paper introduces TFiLM, a temporal extension to Feature-wise Linear Modulation (FiLM) layers which is especially well-suited to sequence modelling tasks. The main contribution is the introduction of a novel architecture that effectively uses convolutional and recurrent layers. Specifically, the authors propose the following processing pipeline: 1) perform a convolutional layer, 2) split the resulting activations in T blocks 3), max-pool the representations within each block, 4) apply an LSTM over the block representations which outputs affine scaling parameters at each step, 5) modulate the representations in each block with the FiLM parameters. The paper evaluates the new architecture on three sequence modelling tasks: text classification, audio super resolution, and chromatin immunoprecipitation sequence super-resolution. For the text classification tasks, the proposed architecture is compared against an LSTM and ConvNet baseline. The proposed architecture always outperforms the pure ConvNet baseline, and improves over the LSTM when the context is long. For the super-resolution tasks, the TFILM architecture consistently outperforms ConvNet baselines. Strengths: - The paper is well-written and easy-to-follow - The proposed architecture is easy to implement and performs well across a wide array of sequence modelling tasks Main concerns: - While the baselines are sensible, the paper doesn’t report the state-of-the-art results for the benchmarks. Reporting SOTA numbers would increase my confidence that you’ve put enough effort into tuning the hyper parameters of the baselines. - The RNN enforces left-to-right processing. I wonder whether two-way processing would further increase performance, for example with a bidirectional RNN or transformer model. Did you experiment with this? Minor: - Missing citations: The self-modulation technique is very related to Squeeze-and-Excitation networks, Hu et al. Also, FiLM layers for VQA were introduced in Modulating Early Visual Processing by Language, de Vries et al. - l. 109 It is only until this line that you specify that you’re using max-pooling. I’d suggest to specify this earlier in the manuscript.

Reviewer 3

The rebuttal addressed my concerns on the baselines for the text classification task. ------ The idea of TFiLM is novel, and the formulation of the module makes a lot of sense. I would give a higher score if the rebuttal can address the following concern on experiments. In the experiments, there is a lack of proper baselines. I am not an expert in audio super-resolution or genome sequencing. That said, at a glance, the audio super resolution only includes self-contained results produced by the authors. It is therefore hard to evaluate the relative significance with respect to that literature. The baselines for text classification seem quite weak, especially for convolutional networks. Also, for text classification it has been shown that beating a bag-of-words model is quite difficult, and the paper did not include benchmarks on them.