NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:3328
Title:Deep Model Transferability from Attribution Maps

Reviewer 1

The authors propose to apply attribution maps to quantify the transferability of deep models, and achieves reasonably accurate results with much faster estimation speed. The approach is, despite simple, quite novel and reasonable. Unlike the taskonomy one that relies on training pairwise or higher-order transfers using annotations, the approach raised here requires no labels and runs very fast. Extensive results demonstrate the validity of the proposed approach. The manuscript is written well; the approach is well motivated, interesting, and practical, as it finds its application in scenarios where pre-trained models are available yet no annotations are provided. The approach could be potentially inspiring to a large audience. ___ After rebuttal After reading all the review comments and the author feedback, I keep my original score. I think this paper makes a good contribution to NeurIPS.

Reviewer 2

The core idea is to measure the task relatedness via the similarity of attribution maps. The paper is overall well written and easy to follow. The method poses barely any constraint on the model architectures, requires no labelled data, and is several-magnitude times faster than taskonomy, which makes it very practical. Furthermore, the fact that it enables flexible insertion of new tasks makes the approach more attractive. Some experimental results are very impressive. It will be better if the authors could show and analyze some bad cases, where two tasks are very related but the corresponding attribution maps are not so similar. The authors should provide more discussions on the rationale behind the fact that, the proposed method works well even the probe data are quite different from the training data of the trained models. In the experiments, SVCCA seems to be working well. What’s the advantage of the proposed method over SVCCA?

Reviewer 3

The submission is well-written and organized in general. It is easy to follow and the motivation is clear. The proposed method is straightforward and technically sound even no source codes are available for justification. Originality: incremental. The proposed investigation aims to address the supervised learning issue in "taskonomy" by employing attribution maps. Quality and clarity: can be further improved. See 5. Significance: limited. See 5.

Reviewer 4

Weaknesses: 1) There are 3 sources for similarities/transferabilities reported in this paper: attribution maps (the proposed method), SVCCA, and Taskonomy. The transferabilities of taskonomy have a practical value (they’re constructed and are shown to reduce the need for supervision through transfer learning), but Taskonomy’s method is computationally expensive. So, the gold standard is duplication of taskonomy’s affinity matrix, but with less complexity. Therefore I see the comparison between the transferability matrix by attribution maps and taskonomy’s (fig 4) valid and what the main point is. But I don’t understand why/how SVCCA vs attribution map’s similarity matrix comparisons (figure 3) are useful. What exactly is the value of SVCCA based similarity matrix? Why isn’t figure 3 comparing between attribution map’s matrix and Taskonomy’s affinity matrix (after being made symmetric)? As I said the practical value of task similarity has been shown for the taskonomy affinity matrix (Fig 7 of Taskonomy paper), so it makes sense to aspire to duplicate that, regardless of SVCCA. In this regard the paper in L225-228 brings in a question to justify comparing against SVCCA: if similarity between attribution maps correlates with similarity between representations by a neural network. But as I reiterated, an absolute similarity between representations of neural networks dont seem to have any practical value unless that similarity is shown to mean transferability (which is what taskonomy affinity matrix does). So why this evaluated assumption is relevant, beyond the comparison with tasjonomy’s affinity matrix, is unclear to me. 2) Related to the above point, the paper seem to suggest attribution maps and SVCCA in the end yield similar task similarity matrices (Fig 3). Then why do the authors believe attribution maps is a novel method that is worth publication, if its final outcome is the same as SVCCA’s? Like the proposed method SVCCA also doesn't need separate labeled data, so supervision is not the advantage. If compute is the advantage, then it should be reported to be clear by how much the attribution maps are more efficient (though I dont find only computational efficiency as an exciting advantage, at least compared to not needing labeled data). Overall, I think the role of SVCCA should be clarified in this submission. 3) The proposed method strictly results in a symmetric task similarity matrix (eq 1 and L174). This seem like a strong constraint and limitation, as the task transferability is not a symmetric property (ie if A transfers well to B, that doesn’t mean B will transfer well to A -- see Fig 7 of taskonomy paper). This makes sense when thinking of task transferability in an information theoretic manner. However, I’m surprised that Fig 4 shows that the symmetric task similarity matrix by attribution maps can be a good prediction of traskonomy’s asymmetric transferability relationships. Are the taskonomy’s top-k sources retrieved as-is from the Taskonomy’s affinity matrix, or are they forced to be symmetric beforehand? Overall, how limiting is the symmetry constraint (ideally reported quantitatively). 4) I didn’t quite find the attribution maps qualitatively intuitive (Fig 2). The attended areas of the image don’t seem to be related to the actual task (e.g. in 3D tasks the 3D worth pixels don't seem to be attended). Or there seem to be some clusters of attended pixels without a clear semantic meaning behind them. However, the resulting analysis using the attribution maps seem to work (sec 4.2), so quantitative value seem to exist. But as this state I fail to spot a qualitative value. 5) related to point 1 above, the analysis in sec 4.3 is more like a curious experiment and intuitive evaluations of the trends. As SVCCA (ie just similarity between representations) doesn’t mean transferability value necessarily, the trends in section 4.3 do not necessarily have a practical value. However, I still think having them is better than removing them, but I would clarify the observed trends don’t necessarily have a conclusive practical value. 6) Per Fig 4, it seems that the probe datasets other than taskonomy better predict the taskonomy transferability than the probe dataset based on taskonomy itself. This seems counter intuitive. How do you justify this? 7) The paper should cite and compare with the recent works that also attempt to duplicate taskonomy’s affinity matrix but with cheaper methods. E.g. “Representation Similarity Analysis for Efficient Task taxonomy & Transfer Learning”, CVPR19.