Paper ID: | 964 |
---|---|

Title: | Learnable Tree Filter for Structure-preserving Feature Transform |

1. Originality: The idea is motivated by Yang, stereo matching using tree filtering, PAMI2015. But it is considerably novel in incorporating this idea in an end-to-end deep network. 2. Quality: + Tree structure to exploit irregular shaped long term dependencies. + Detailed ablation analysis. + More efficient than PSP and non-local operations. 3. Clarity: This paper is in general well written. But I still have some concerns to be clarified: - How the higher level semantics/features can guide to generate the MST (L40-41, L141-143)? In my understanding, is the MST generate based on the weights calculated by the CURRENT feature map rather than higher ones? - How to initialize the MST weights? Are they initialized once (by Euclidean distance on features of each pixel) then updated solely by gradients, or the Euclidean distance should be calculated every step (as the CNN paras change)? - What is multi-groups in Figure 2 and how does it take efforts? 4. Significance: The proposed method better exploits irregular shaped long term dependencies for semantic segmentation. ---------------- I appreciate the authors' rebuttal which well addressed my concerns, therefore, I remain my recommendation as an accept.

Novelty ====== This paper introduces a clear new module into the crowded space of contextual models for semantic segmentation. In addition to the value of the core idea, the authors explain how to differentiate their tree filtering operator, enabling end-to-end training. Finally, they also propose a technically interesting way to reduce the complexity of computing their tree filtering module down to linear in the number of pixels. Experiments ========== The experiments are thorough, with lots of comparisons to the state-of-the-art and a clear ablation study. However, the improvements over the state-of-the-art are very minor: +0.6% mIoU over PSP [12] on VOC12 val (table 3), +0.2% mIoU over DenseASPP [32] on Cityscapes test (table 5), and +0.1% mIoU over ExFuse [36] on VOC12 test with MS-COCO pretrain (table 6). In the light of these numbers, I find the results oversold: the abstract states "leading performance on VOC 2012 (86.3% mIoU) and Cityscapes (80.8% mIoU)", exactly referring to the last two results I listed in this review (+0.1% and +0.2%). Moreover, the conclusion states "superiority of the proposed method on VOC12 and Cityscapes". These claims are not justified by such a small delta. It would be small for any system, but especially for neural networks, due to the inherent randomness of their SGD training procedure. The one significant result I could find is +1.3% on VOC12 test set without MS-COCO pretraining, but that's not what is sold in the abstract. The ablation studies in Tables 1 and 2 show the impact of the proposed new module starting from a system with no context at all. The effect is nice (+2.1% on ResNet101 on VOC12), but the true comparison is: how much more does this new context module bring compared to previous ones? And the cleanest answer I could find is in Table 3: +0.6% over PSP [3] and +0.7% over NL [12]. That is a small effect. Quality of writing ============= The quality of writing is good, but not great. There are many missing or oddly places articles, incorrect verb conjugations, and so on. Summary ======= This paper offers good novelty and it is technically interesting. However, the results are underwhelming and the comparisons to the state-of-the-art are very oversold. Reaction to rebuttal =============== The authors have provided some reply to my critique that the improvements over the state-of-the-art are small, mainly with one new experiment and by promising to include dilated convolutions to improve performance of their backbone network. Most importantly, they promised to revise their claims in the final version. This is very important. If the delta improvements remain as small as they were in the submission version, then their claims must be toned down. In the light of the rebuttal, I keep my original score and trust the authors to keep their promise.

Originality: The proposed approach is motivated by the traditional tree filter [reference 18 in the paper]. While the idea of combining a combinatorial optimization algorithm with (deep) learning has become quite common, the particular approach poses challenges implementationwise and provides many benefits, as it allows to model long-range spatial dependencies in an efficient way, a very desirable property. Quality: The paper contains sufficient ablation studies and experiments that demonstrate the strengths of the method in terms of its differences to other recent methods. One critique is a claim about the computational complexity of the proposed algorithm made in lines 124-125: "...the computational complexity of all the processes in algorithm 1, including the construction process of minimum spanning tree and the computation of pairwise distance, is O(N)...". It is mentioned in line 143 that the Boruvka algorithm is used (reference 25 in the paper). Nevertheless, the Boruvka algorithm is O(N log N). The authors should either clarify or correct this claim. While there is a randomized algorithm that achieves linear complexity in expectation, this is not mentioned in the paper. Clarity: The paper is clearly written and well organized. There are a few cases of language misuse and typos, but these can be easily fixed. Some examples: Line 1: "plays a vital role in semantic segmentation. Most of the existing methods..." Line 116: "Also, we propose..." Line 120: "The computation of these variables can be accelerated..." Significance: The idea of exploiting structure in images within a learning framework (as a neural network) has been desired for a long time. It is known that convolutional neural networks can increase the receptive field with depth but the effective receptive field remains concentrated. In recent years, various graph neural networks have been proposed and studied extensively, but an efficient / linear time algorithm was missing. Non-local networks have also been proposed but due to their O(N^2) complexity, they are only applicable for small graphs. I believe the proposed approach is very likely to be used and build upon by other researchers.