Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
- - - - - - - - - post-rebuttal - - - - - - - - - - After reading the rebuttal and the other reviews, I am happy to recommend acceptance of this paper. My concerns were answered in the rebuttal and I do not see any major concerns in the other reviews. The provided explanations and results should be integrated into the final version. - - - - - - - - - original review - - - - - - - - - - W1 - Evaluation W1.1 The 2D model has much less capacity than the 3D/4D model. Thus, one could evaluate how much the LSTM module contributes by repeating the RGB input-frame 12 times. This way the model architecture is identical to the 4D version but the input data is only 2D. A direct comparison would then be possible. W1.2 How are the results for other (learning based) methods in Tab 2 obtained on the dataset that is introduced with this submission? Are they retrained and then evaluated or are the pretrained models evaluated on the new data? The later case would make the evaluation not directly comparable since different data was used. W2 - LSTM One of the main advantages of sequence models is that they can deal with variable lengths. Here, every “sequence” is of fixed length and the order of focus planes is not really crucial for the task. So the question arises if there is a simpler model that can deal with the presented data. For example a temporal convolution approach could also be used to fuse feature map across time steps and would reduce the complexity of the model. An order agnostic operation such as max/average-pooling would make the method invariant to the order of input images. W3 - Clarity W3.1 Table 1 is very tedious to read since one has to refer to the caption of Fig 3 to understand the row labels. A better solution would be including a short-hand notation for the ablated components that are written in the appropriate column (also Fig 3 could benefit from this). W3.2 The notation is slightly difficult to follow since there is often double and triple indexing, variables, indices and functions using words or multiple letters (Att, Conv_dilated, rates_set). I.e. operators should be typeset appropriately to make a proper visual distinction to variables and their multiplication.
Originality: my feeling it that the proposed neural network has some novelty, but the authors did not position clearly the paper with respect to the related work. The proposed architecture includes many components (such as the Memory-oriented Spatial Fusion Module, The Memory-oriented Integration Module ) and it’s not clear if they are original or not. I’m not an expert in the field, but I see some similarity with . The originality of each component, and of the overall architecture, should be discussed and compared with similar ones. Quality: yes, I think the paper is technically sound, the method justified enough. The proposed method is supported by the results. I would expect the authors to highlight the limitations of their approach, which is not the case. Clarity: the paper reads well, even if it’s very dense and sometimes hard to follow (due to the density). But on overall, I think all the elements given in the papers can be understood, and they cover the proposed neural network well. Significance: the results are very good on the task of SOD for Light Field images, I have no doubts about it. But what about other areas? Does it also give good results on hyper spectral images? Competing papers have shown their approaches can be successfully used in several domains, in this paper only one domain is used, which limits the significance of the paper.
Originality: Moderate. Previous methods have demonstrated the light field image can improve saliency detection. So the main novelty in this paper might be using a deep network to solve this problem. Using the ConvLSTM to fuse features from different focal lengths and different levels of VGG is interesting, but not a significant innovation. Quality: High. The demonstration of the method is clear and related works seem to be complete. Also, there ae comprehensive experiments for ablation study and to compare with previous state-of-the-art. The experiments results are convincing and thorough. Clarity: The paper is easy to read and well organized. Figure 2 demonstrates the network structure clearly and concisely. All necessary details and explanations have been included in the submission. Significance: Given better performance compared with previous works, as well as indicated release of the dataset and code, his paper may be of interest to other researchers working on similar topics. Minor issues: On line 185, is t-1 equal to 13? The intuition of using H_t-1 in SCIM is not clear.