Review for NeurIPS paper: ARMA Nets: Expanding Receptive Field for Dense Prediction

NeurIPS 2020

ARMA Nets: Expanding Receptive Field for Dense Prediction

Review 1

Summary and Contributions: The paper tries to resolve a well known problem in current CNN architectures. The strong connection between the depth and receptive field. In order to disentangle the receptive field from the depth they propose a new layer. The basis for the layer is a CNN with an auto-regressive output. They show that this architecture indeed has a larger maximal receptive field and an effective receptive field parametrizable and data dependent. Finally, they show a closed form solution to the gradient estimation problem, an efficient FFT parametrization that allows for only a small overhead with respect to a simple CNN and finally analyse the stability of this layer and suggest simple way to ensure it.

Strengths: The paper contribution is novel and quite interesting. It is potentially interesting both to the applied branch of the field as well as stimulating to future research developments in the domain of architecture design.

Weaknesses: I find the exposition of the FFT not clear enough in the main paper. Also there is little coverage on the consequences of FFT truncation both theoretically and practically. I find the experimental analysis somewhat lacking. Since the theoretical contributions seem to remove most of the impediments to large scale experiments I expected a few standard experiments to "test drive" the approach. I know it is a big effort but I think the method would be quite a bit more compelling if it demonstrates improvements on something like MSCOCO segmentation where object size varies a lot and show if there is benefit there.

Correctness: The paper seems technically correct. I haven't fully checked the ERF and stability line of results.

Clarity: The paper is well written.

Relation to Prior Work: The related work is quite well exposed I find. I would spend just one or two more phrases on the Spatial recurrent neural networks paragraph because that is the closest to the work thus the differences are most important to understand. I would explain why it is not possible to FFT on those but is possible with ARMA. Maybe efficiency implications though that is handled elsewhere.

Reproducibility: Yes

Additional Feedback: - if i understand this correctly the FFT cost is not in the equations in computational overhead. - nit: i think you mean "proceed" not "precede" in your proofs.

Review 2

Summary and Contributions: This paper proposes an autoregressive moving-average (ARMA) layer that can adapt the effective receptive field (ERF) to different tasks and datasets. A practical method to compute the forward and backward pass of the proposed ARMA layer is included. Theoretical analysis on the learning stability is provided as well. The proposed ARMA layer can be used in common networks by replacing the original convolutional layers. Experiments on two datasets (each corresponds to one task) are conducted to evaluate the effectiveness of the ARMA layer.

Strengths: The targeted problem is interesting and important. Enlarging the ERF is a hot topic in deep learning for computer vision. Recent advances like the self-attention mechanism (non-local networks) have been very popular. The introduction of the proposed method is clear and complete. A practical way to compute and train the ARMA layer efficiently is provided, along with a method to increase the learning stability. The proposed method that uses the autoregressive kernel to increase the ERF has some novelty.

Weaknesses: Some claims are not convincing to me. For example, the authors claim in Section 1 that "ARMA networks are complementary to the aforementioned architectures including encoder-decoder structured networks, dilated convolutional networks and non-local attention networks." I agree with the "encoder-decoder structured networks" part but am less convinced that ARMA networks are complementary to dilated convolutional networks and non-local attention networks. In the experimental results in Tables 2&3, using the ARMA layer together with dilated convolutions or the attention mechanism does not bring any improvement. Such results do not suggest the "complementary" relationship. Another example of unconvincing claims lies in Section 7, where the authors claim that the proposed ARMA layer is related to several methods. However, no explanation on how they are related is given. In addition, citations of the mentioned methods are not provided. Besides the above problems, the major weakness of this work is in its experiments, as elaborated below. - Semantic Segmentation on Medical Images: The authors chose the attention u-net [24] as a baseline that uses the attention mechanism. However, [24] actually uses the gate mechanism instead of the attention mechanism in [30,31]. An obvious difference is that the gate mechanism 'filters' each input location by multiplying with a scalar number, while the attention mechanism computes a weighted sum of input locations. In addition, the purpose of using the gate mechanism in [24] is irrelevant to increasing the ERF. In order to compare the ARMA layer with the attention mechanism in the u-net framework, a much more suitable baseline is the AAAI'20 paper "Non-local U-Net for Biomedical Image Segmentation", where non-local attention blocks are used to replace convolutions/deconvolutions in u-net in order to achieve global RF. In addition, since the results in Table 2 are average of 10 runs, std should be reported to make the comparison fair and convincing. - Pixel-level Video Prediction: My major concerns lie in the results in Table 4. Inserting the non-local attention block to Conv-LSTM does not serve as a fair baseline. The functionality of the non-local attention block and the LSTM is overlapping. In [31] and many other studies using the self-attention mechanism, the LSTM is not used at all. Inserting the non-local attention block to Conv-LSTM will not show the true effectiveness of the non-local attention block. The backbone model of this task should be changed to more popular and powerful models, e.g. the backbone model used in [31]. To conclude, 'weak' or inappropriate baselines are used in the experiments, making the results less convincing.

Correctness: The method is technically sound. However, some claims and the experimental designs have flaws, as explained in the 'Weaknesses' section.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Comments after author feedbacks: For the semantic segmentation experiments, the comparisons in Feedback Table 1 appropriately support the claims of the authors. However, I'd like to put a remark here. The reported results in Feedback Table 1 look interesting if we compare the numbers with the original Table 2. According to the updated results, the non-local block itself hurts the performance but improves the performance when used together with ARMA layers. There is a possibility that the non-local U-Net baseline was not well tuned. More details of this set of experiments, like model architectures, must be included in the updated version. Meanwhile, I'm not sure whether such "big" revision is allowed in NeurIPS. For the video prediction experiments, I disagree that "the attention mechanism is only added *per-timestep*". Supplementary Figure 8 shows how the authors use the non-local block. It is neither the *per-timestep* nor the original way in [31]. Indeed, I'm NOT expecting the replacement of all the attention mechanisms. However, how the authors use the non-local block doe NOT lead to a fair comparison. Even if ConvLSTM is a good baseline, they should use a 2D non-local block to replace/insert after the original convolutions, in order to compare the effect of expanding the receptive field. I tend to keep my score. I read the other reviewers' comments as well. I sincerely suggest R1 and R4 take a closer look at the experiments.

Review 3

Summary and Contributions: The paper proposes autoregressive moving-average layer (ARMA), a plug-and-play extension of convolution layers that enables the flexible expansion of the CNN models’ effective receptive field. For efficiently performing the forward and backward passes of ARMA, a FFT-based algorithm is proposed. Further techniques are adopted to ensure the training stability. Experimental results show that the integration of ARMA into existing models leads to substantial performance gain in medical image segmentation and video prediction.

Strengths: a) The idea of inserting an additional convolution operation to output neurons is novel, which enables arbitrarily enlarging the receptive field of convolution layers without causing the gridding artifacts that dilated convolution typically have. b) The theoretical soundness of the tunable ERF increase, the FFT-based forward/backward algorithm and the stability constraints of ARMA layers are proved in details. c) Extensive experiments support the effectiveness of the proposed method on different dense prediction tasks. d) The proposed method may be applicable to convolution layers in any CNN model, thus having considerable potential impact in easy improvement over existing models in different problems.

Weaknesses: a) The paper does not suggest strategies to select autoregressive coefficients a under different tasks or other circumstances (e.g. integrating ARMA into multi-scale CNN models) b) It would be more interesting if the paper can discuss the effect of using different autoregressive coefficients for convolution layers at different depths. c) It would be better to also experiment ARMA on regression and generative tasks. After reading the authors' rebuttal, my concerns of the paper have been addressed. I think it's a good paper to be published in ECCV.

Correctness: The claims, the method and the methodology seems correct.

Clarity: The paper is well written. The proposed method and the corresponding proofs are easy to understand. Lucid visualizations are provided to aid the understanding of key concepts.

Relation to Prior Work: Previous works on expanding the CNN effective receptive field are comprehensively discussed, and the main distinctions and contributions of the proposed method are pointed out.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The paper proposes a new layer for learning dynamically sized effective receptive fields. The layer uses autoregressive transformations after a convolutional layer to transport signals from further away. The paper gives experimental evidence for the effectiveness of this layer.

Strengths: Thorough analysis of the proposed ARMA layer and its computation Proposes an efficient implementation of the ARMA layer that runs in parallel, despite ARMA's sequential definition. Strong on both theorethical and empirical analysis

Weaknesses: Experimental results are not the most convincing Comparison with quite similar methods such as the QRNN [Bradbury et al] missing, although comparison with other methods is mentioned. Would be nice to see evidence on improvement of ARMA also on other tasks with some long range dependencies. Perhaps in Language modelling? Potential technical instability mentioned for ARMA (and how to resolve them). But these could make it harder to adopt or use the layer in practice.

Correctness: Yes. Yes.

Clarity: Pretty well written. Some minor typos.

Relation to Prior Work: Quasi-RNN seem to be the most obvious omission. QRNN also have a convolutional part and a gated autoregressive part to aggregate long-range context. This is similar to ARMA, but contrary to ARMA, QRNN are gated and non-linear.

Reproducibility: Yes

Additional Feedback: