Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Originality: To the best of my knowledge the proposed approach is original. The new loss MaxSSN and the new training strategy does seem to significantly outperform the state of the art AVOD in all experiments. Same with the latent ensemble layer, Clarity: The paper is clearly written and easy to follow. Quality: Although the propose method is simple, it seems effective. Technically the methods seems sounds and works. However, it may actually work only for the case demonstrated where there are two sources of information being fused. It remains to see whether it will work in a different setting, where sources of information have some other dependencies, or may even be correlated. Also, what happens when there are more than two sources of information. Significance: The proposed method is significant enough to merit publication.
Originality. Pros: To the best of my knowledge, the proposed method is new. Quality. Pros: - Detailed proofs and analysis of the proposed method are provided. - Ablation study is provided for a good understanding of the work. - Experimental results are good. Cons: - The paper claims that the method preserves the original performance on clean data. However, there are evident performance drop on clean data after applying the TrainSSN algorithm. For example, the 3D AP and BEV AP decrease by ~5% and ~7% for moderate clean data in Table 1. - For demonstration of the generalization ability of the method, it would be better if the paper could validate on more tasks/models. Currently only one task (3D detection) and one model (AVOD) is validated. Clarity. Pros: - In overall, the paper is well written and organized. Cons: - The hyper-parameter (t) in the sparse constrain of the latent ensemble layer is missing. Significance. Pros: The robust learning problem in feature fusion model that the paper addresses is an important topic and critical in practical applications like autonomous driving. I think the proposed method and results should be useful for the community.
Summary This paper discusses the importance and the method for deep fusion model with single-source noise with experiments on 3D/BEV object detection. It first proposes a novel loss called MAXSSN, as a loss used in the whole paper for single-source robustness. It then shows the limitation of standard robust fusion model -- if we do not consider every single loss separately -- adding all of them to the input at once, we would get a worse model. Two algorithms are proposed for minimizing the MAXSSN loss. The basic idea is to alternatively train on clean data and data with noise. The authors then provide a feature fusion method to ensemble feature maps from different input sources. It is assumed that all feature maps have the same width and height but (possibly) different number of channels. The basic idea of the feature fusion method is to concatenate all the feature maps and then apply a convolutional layer with 1x1 kernel. The author finally shows the experiment on KITTI dataset, doing 3D/BEV object detection. The results indicate when the data contain single-source noise, the proposed method has better performance. Strengths -The problem of robustness is important, and single-source robustness is novel. -The paper is clearly written, with almost all of the variables/notations clearly defined. -This paper provides plenty of experiments (in the main text and supplementary), showing good performance of the proposed method. Weaknesses -Analysis for algorithm 1 and 2 are proposed without much theoretical analysis. They are “suddenly” proposed and then followed by the experiments, without illustration about why the algorithms are designed in this way, or their theoretical analysis. For example, why do we have to alternate between clean data and data with noise? Algorithm 2 ignores the maximum loss, so why does it work? -The proposed latent ensemble layer is not a very novel way to fuse features. It concatenates all the feature maps and then applies a 1x1 convolutional layer, which is a reasonable way of fusion but without much novelty. -The problem of robustness is definitely important, but from my understanding, usually, noise does not come from a single source but at least a few of input sources. Can the proposed method be easily generalized? If not, we may need more illustration about the importance of single-source robustness, about why we should focus on single-source noise. Minor issues -Equation 4: some of the 2-norm subscripts are missing -Line 8 of algorithm 1: on the right-hand side you may use the notation you introduced in line 6 Comments after the rebuttal ************************************ Thank all the authors for the rebuttal. I am satisfied with the authors' explanation about the importance of single-source robustness. But judging from the rebuttal, the authors do not have a theoretical proof about TrainSSN and TrainSSNAlt (e.g., convergence, an upper bound of loss). In summary, the paper is the first one to formalize single-source robustness and provides two baseline algorithms without proof, which is acceptable. I believe it is marginally above the acceptance threshold, so I raised my score to 6.