{"title": "Deep Model Transferability from Attribution Maps", "book": "Advances in Neural Information Processing Systems", "page_first": 6182, "page_last": 6192, "abstract": "Exploring the transferability between heterogeneous tasks sheds light on their intrinsic interconnections, and consequently enables knowledge transfer from one task to another so as to reduce the training effort of the latter. In this paper, we propose an embarrassingly simple yet very efficacious approach to estimating the transferability of deep networks, especially those handling vision tasks. Unlike the seminal work of \\emph{taskonomy} that relies on a large number of annotations as supervision and is thus computationally cumbersome, the proposed approach requires no human annotations and imposes no constraints on the architectures of the networks. This is achieved, specifically, via projecting deep networks into a \\emph{model space}, wherein each network is treated as a point and the distances between two points are measured by deviations of their produced attribution maps. The proposed approach is several-magnitude times faster than taskonomy, and meanwhile preserves a task-wise topological structure highly similar to the one obtained by taskonomy. Code is available at \\url{https://github.com/zju-vipa/TransferbilityFromAttributionMaps}.", "full_text": "Deep Model Transferability from Attribution Maps\n\nJie Song1,3, Yixin Chen1, Xinchao Wang2, Chengchao Shen1, Mingli Song1,3\n\n1Zhejiang University, 2Stevens Institute of Technology\n\n3Alibaba-Zhejiang University Joint Institute of Frontier Technologies\n\n{sjie,chenyix,chengchaoshen,brooksong}@zju.edu.cn\n\nxinchao.wang@stevens.edu\n\nAbstract\n\nExploring the transferability between heterogeneous tasks sheds light on their\nintrinsic interconnections, and consequently enables knowledge transfer from one\ntask to another so as to reduce the training effort of the latter. In this paper, we\npropose an embarrassingly simple yet very ef\ufb01cacious approach to estimating the\ntransferability of deep networks, especially those handling vision tasks. Unlike\nthe seminal work of taskonomy that relies on a large number of annotations as\nsupervision and is thus computationally cumbersome, the proposed approach\nrequires no human annotations and imposes no constraints on the architectures\nof the networks. This is achieved, speci\ufb01cally, via projecting deep networks\ninto a model space, wherein each network is treated as a point and the distances\nbetween two points are measured by deviations of their produced attribution maps.\nThe proposed approach is several-magnitude times faster than taskonomy, and\nmeanwhile preserves a task-wise topological structure highly similar to the one\nobtained by taskonomy. Code is available at https://github.com/zju-vipa/\nTransferbilityFromAttributionMaps.\n\n1\n\nIntroduction\n\nDeep learning has brought about unprecedented advances in many if not all the major arti\ufb01cial\nintelligence tasks, especially computer vision ones. The state-of-the-art performances, however, come\nat the costs of the often burdensome training process that requires an enormous number of human\nannotations and GPU hours, as well as the partially interpretable and thus the only intermittently\npredictable black-box behaviors. Understanding the intrinsic relationships between such deep-\nlearning tasks, if any, may on the one hand elucidate the rationale of the encouraging results achieved\nby deep learning, and on the other hand allows for more predictable and explainable transfer learning\nfrom one task to another, so that the training effort can be signi\ufb01cantly reduced.\nThe seminal work of taskonomy [37] made the pioneering attempt towards disentangling the rela-\ntionships between visual tasks through a computational approach. This is accomplished by training\n\ufb01rst all the task models and then all the feasible transfers among models, in a fully supervised\nmanner. Based on the obtained transfer performances, an af\ufb01nity matrix of transferability is derived,\nupon which an Integer Program can be further imposed to compute the \ufb01nal budget-constrained\ntask-transferability graph. Despite the intriguing results achieved, the training cost, especially that\nfor the combinatorial-based transferability learning, makes taskonomy prohibitively expensive to\nestimate. Even for the \ufb01rst-order transferability estimation, the training costs grow quadratically with\nrespect to the number of tasks involved; when adding a new task to the graph, the transferability has\nto be explicitly trained between the new task and all those in the task dictionary.\nIn this paper, we propose an embarrassingly simple yet competent approach to estimating the\ntransferability between different tasks, with a focus on the computer vision ones. Unlike taskonomy\nthat relies on training the task-speci\ufb01c models and their transferability using human annotations, in\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\four approach we assume no labelled data are available, and we are given only the pre-trained deep\nnetworks, which can be nowadays found effortless online. Moreover, we do not impose constraints\non the architectures of the deep networks, such as networks handling different tasks sharing the same\narchitectures.\nAt the heart of our approach is to project pre-trained deep networks into a common space, termed\nmodel space. The model space accepts networks of heterogeneous architectures and handling different\ntasks, and transforms each network into a point. The distance between two points in the model space is\nthen taken to be the measure of their relatedness and the consequent transferability. Such construction\nof the model space enables prompt model insertion or deletion, as updating the transferability graph\nboils down to computing nearest neighbors in the model space, which is therefore much lighter than\ntaskonomy that requires the pair-wise re-training for each newly added task.\nThe projection to the model space is attained by feeding unlabelled images, which can be obtained\nhandily online, into a network and then computing the corresponding attribution maps. An attribution\nmap signals pixels in the input image highly relevant to the downstream tasks or hidden representa-\ntions, and therefore highlights the \u201cattention\u201d of a network over a speci\ufb01c task. In other words, the\nmodel space can be thought as a space de\ufb01ned on top of attribution maps, where the af\ufb01nity between\npoints or networks is evaluated using the distance between their produced attribution maps, which\nagain, requires no supervision and can be computed really fast.\nThe intuition behind adopting attribution maps for network-af\ufb01nity estimation is rather straight-\nforward: models focusing on similar regions of input images are expected to produce correlated\nrepresentations, and thus potentially give rise to favorable transfer-learning results. This assumption\nis inspired by the work of [36], which utilizes the attention of a teacher model to guide the learning\nof a student and produces encouraging results. Despite its very simple nature, the proposed approach\nyields truly promising results: it leads to a speedup factor of several magnitudes of times and mean-\nwhile maintains a highly similar transferability topology, as compared to taskonomy. In addition,\nexperiments on vision tasks beyond those involved in taskonomy also produce intuitively plausible\nresults, validating the proposed approach and providing us with insights on their transferability.\nOur contribution is therefore a lightweight and effective approach towards estimating transferability\nbetween deep visual models, achieved via projecting each model into a common space and approxi-\nmating their af\ufb01nity using attribution maps. It requires no human annotations and is readily applicable\nto pre-trained networks specializing in various tasks and of heterogeneous architectures. Running at\na speed several magnitudes faster than taskonomy and producing competitively similar results, the\nproposed model may serve as a competent transferability estimator and an effectual substitute for\ntaskonomy, especially when human annotations are unavailable, when the model library is large in\nsize, or when frequent model insertion or update takes place.\n\n2 Related Work\n\nWe brie\ufb02y review here some topics that are most related to the proposed work, including model\nreusing, transfer learning, and attribution methods for deep models.\n\nModel Reusing. Reusing pre-trained models has been an active research topic in recent years.\nHinton et al. [9] \ufb01rstly propose the concept of \u201cknowledge distillation\u201d where the trained cumbersome\nteacher models are reused to produce soft labels for training a lightweight student model. Following\ntheir teacher-student scheme, some more advanced methods [24, 36, 6, 15] are proposed to fully\nexploit the knowledge encoded in the trained teacher model. However, in these works all the teachers\nand the student are trained for the same task. To reuse models of different tasks, Rusu et al. [25]\npropose the progressive neural net to extract useful features from multiple teachers for a new task.\nParisotto et al. [19] propose \u201cActor-Mimic\u201d to use the guidance from several expert teachers of\ndistinct tasks. However, none of these works explore the relatedness among different tasks. In this\npaper, by explicitly modeling the model transferability, we provide an effective method to pick a\ntrained model most bene\ufb01cial for solving the target task.\n\nTransfer Learning. Another way of reusing trained models is to transfer the trained model to\nanother task by reusing the features extracted from certain layers. Razavian et al. [22] demonstrated\nthat features extracted from deep neural networks could be used as generic image representations to\n\n2\n\n\fFigure 1: An illustrative diagram of the work\ufb02ow of the proposed method. It mainly consists of three\nsteps: collecting probe data, computing attribution maps, and estimating model transferability.\n\ntackle the diverse range of visual tasks. Yosinski et al. [34] investigated the transferability of deep\nfeatures extracted from every layer of a deep neural network. Azizpour et al. [2] investigated several\nfactors affecting the transferability of deep features. Recently, the effects of pre-training datasets for\ntransfer learning are studied [13, 7, 12, 33, 23]. None of these works, however, explicitly quantify the\nrelatedness among different tasks or trained models to provide a principled way for model selection.\nZamir et al. [37] proposed a fully computational approach, known as taskonomy, to address this\nchallenging problem. However, taskonomy requires labeled data and is computationally expensive,\nwhich limited its applications in large-scale real-world problems. Recently, Dwivedi and Roig [4]\nproposed to use representation similarity analysis to approximate the task taxonomy. In this paper, we\nintroduce a model space for modeling task transferability and propose to measure the transferability\nvia attribution maps, which, unlike taskonomy, requires no human annotations and works directly on\npre-trained models. We believe our method is a good complement to existing works.\n\nAttribution Methods for Deep Models. Attribution refers to assigning importance scores to\nthe inputs for a speci\ufb01ed output. Existing attribution methods can be mainly divided into two\ngroups, including perturbation- [38, 39, 40] and gradient-based [28, 3, 27, 30, 26, 18, 1] methods.\nPerturbation-based methods compute the attribution of an input feature by making perturbations, e.g.,\nremoving, masking or altering, to individual inputs or neurons and observe the impact on later\nneurons. However, such methods are computationally inef\ufb01cient as each perturbation requires a\nseparate forward propagation through the network. Gradient-based methods, on the other hand,\nestimate the attributions for all input features in one or few forward and backward passes throughout\nthe network, which renders them generally more ef\ufb01cient. Simonyan et al. [28] construct attributions\nby taking the absolute value of the partial derivative of the target output with respect to the input\nfeatures. Later, Layer-wise Relevance Propagation (\u0001-LRP) [3], gradient*input [27], integrated\ngradients [30] and deepLIFT [26] are proposed to aid understanding the information \ufb02ow of deep\nneural networks. In this paper, we directly adopt some of these off-the-shelf methods to produce the\nattribution maps. Devising more suitable attribution method for our problem is left to future work.\n\n3 Estimating Model Transferability from Attribution Maps\n\nWe provide in this section the details of the proposed transferability estimator. We start by giving\nthe problem setup and an overview of the method, followed by describing its three steps, and \ufb01nally\nshow the ef\ufb01ciency analysis.\n\n3.1 Problem Setup\nAssume we are given a set of pre-trained deep models M = {m1, m2, ..., mN}, where N is the total\nnumber of models involved. No constraints are imposed on the architectures of these models. We use\nti to denote the task handled by model mi, and use T = {t1, t2, ..., tN} to denote the task dictionary,\ni.e., the set of all the tasks involved in M. Furthermore, we assume that no labeled annotations\nare available. Our goal is to ef\ufb01ciently quantify the transferability between different tasks in T , so\nthat given a target task, we can read out from the learned transferability matrix the source task that\npotentially yields the highest transfer performance.\n\n3\n\n\ud835\udc4e\ud835\udc62\ud835\udc61\ud835\udc5c\ud835\udc52\ud835\udc5b\ud835\udc50\ud835\udc5c\ud835\udc51\ud835\udc52\ud835\udc5fCollecting Probe DataForwardBackwardComputing Attribution MapsEstimating Model TransferabilityModel Transferability2\ud835\udc51\ud835\udc58\ud835\udc52\ud835\udc66\ud835\udc5d\ud835\udc61\ud835\udc60\ud835\udc4e\ud835\udc62\ud835\udc61\ud835\udc5c\ud835\udc52\ud835\udc5b\ud835\udc50\ud835\udc5c\ud835\udc51\ud835\udc52\ud835\udc5f\ud835\udc51\ud835\udc52\ud835\udc5b\ud835\udc5c\ud835\udc56\ud835\udc60\ud835\udc52\ud835\udc50\ud835\udc62\ud835\udc5f\ud835\udc63\ud835\udc4e\ud835\udc61\ud835\udc62\ud835\udc5f\ud835\udc52\ud835\udc50\ud835\udc59\ud835\udc4e\ud835\udc60\ud835\udc6010003\ud835\udc51\ud835\udc52\ud835\udc51\ud835\udc54\ud835\udc522.5\ud835\udc51\ud835\udc60\ud835\udc52\ud835\udc54\ud835\udc5a.\ud835\udc5f\ud835\udc54\ud835\udc4f2\ud835\udc5a\ud835\udc56\ud835\udc60\ud835\udc61\ud835\udc5f\ud835\udc54\ud835\udc4f2\ud835\udc60\ud835\udc53\ud835\udc5b\ud835\udc5c\ud835\udc5f\ud835\udc5a\ud835\udc5f\ud835\udc54\ud835\udc4f2\ud835\udc51\ud835\udc52\ud835\udc5d\ud835\udc61\u210e3\ud835\udc51\ud835\udc58\ud835\udc52\ud835\udc66\ud835\udc5d\ud835\udc61\ud835\udc602\ud835\udc51\ud835\udc52\ud835\udc51\ud835\udc54\ud835\udc522\ud835\udc51\ud835\udc60\ud835\udc52\ud835\udc54\ud835\udc5a.\ud835\udc50\ud835\udc62\ud835\udc5f\ud835\udc63\ud835\udc4e\ud835\udc61\ud835\udc62\ud835\udc5f\ud835\udc52\ud835\udc50\ud835\udc59\ud835\udc4e\ud835\udc60\ud835\udc601000\f3.2 Overview\n\nThe core idea of our method is to embed the pre-trained deep models into the model space, wherein\nmodels are represented by points and model transferability is measured by the distance between\ncorresponding points. To this end, we utilize the attribution maps to construct such a model space.\nThe assumption is that related models should produce similar attribution maps for the same input\nimage. The work\ufb02ow of our method consists of three steps, as shown in Figure 1. First, we collect an\nunlabeled probe dataset, which will be used to construct the model space, from a randomly selected\ndata distribution. Second, for each trained model, we adopt off-the-shelf attribution methods to\ncompute the attribution maps of all images in the constructed probe dataset. Finally, for each model,\nall its attribution maps are collectively viewed as a single point in the model space, based on which\nthe model transferability is estimated. In what follows, we provide details for each of the three steps.\n\n3.3 Key Steps\n\n1, xi\n\n2, ..., xi\n\nW HC\n\nStep 1: Building the Probe Dataset. As deep models handling different tasks or even the same\none may be of heterogeneous architectures or trained on data from various domains, it is non-trivial\nto measure their transferability directly from their outputs or intermediate features. To bypass this\nproblem, we feed the same input images to these models and measure the model transferability by\nthe similarity of their response to the same stimuli. We term the set of all such input images probe\ndata, which is shared by all the tasks involved.\nIntuitively, the probe dataset should be designed not only large in size but also rich in diversity, as\nmodels in M may be trained on various domains for different tasks. However, experiments show\nthat the proposed method works surprisingly well even when the probe data are collected in a single\ndomain and of moderately small size (\u223c 1, 000 images). The produced transferability relationship is\nhighly similar to the one derived by taskonomy. This property renders the proposed method attractive\nas little effort is required for collecting the probe data. More details can be found in Section 4.2.3.\n\nStep 2: Computing Attribution Maps. Let us denote the collected probe data by X =\n\n(cid:3) \u2208 RW HC, where W , H and C respectively de-\n\n{X1, X2, ..., XNp}, Xi = (cid:2)xi\n\nnote the width, the height and the channels of the input images, and Np is the size of the probe\ndata. Note that for brevity the maps are symbolized in vectorization form here. For model mi, it\ntakes an input \u02dcX = Ti(X) \u2208 RWiHiCi and produces a hidden representation R = [r1, r2, ..., rD].\nHere, Ti serves as a preprocessing function that transforms the images in probe data for model mi,\nas we allow different models to take images of different sizes as input, and D is the dimension of\nthe representation. For each model mi in M, our goal in this step is to produce an attribution map\nj = [ai\nAi\n\nj2, ...] \u2208 RW HC for each image Xj in the probe data X .\n\nj1, ai\n\n(cid:80)D\n\nj\n\nj = 1\nD\n\nk=1 Ai,k\n\nIn fact, an attribution map Ai,k\ncan be computed for each unit rk in R. However, as we consider the\nj\ntransferability of R, we average the attribution maps of all r in R as the overall attribution map of R.\nFormally, we have Ai\n. Speci\ufb01cally, here we adopt three off-the-shelf attribution\nmethods to produce the attribution maps: saliency map [28], gradient * input [27], and \u0001-LRP [3].\nSaliency map computes attributions by taking the absolute value of the partial derivative of the\ntarget output with respect to the input. Gradient * input refers to a \ufb01rst-order taylor approximation\nof how the output would change if the input was set to zero. \u0001-LRP, on the other hand, computes\nthe attributions by redistributing the prediction score (output) layer by layer until the input layer\nis reached. For all the three attribution methods, the overall attribution map Ai\nj can be computed\nthrough one single forward-and-backward propagation [1] in Tensor\ufb02ow. The formulations of the\nthree attribution maps are summarized in Table 1. More details can be found from [28, 27, 3, 1].\n\nTable 1: Mathematical formulations of saliency map [28], gradient * input [27] and \u0001-LRP [3]. Note\nthat the superscript g denotes a novel de\ufb01nition of partial derivative [1].\n\n(cid:104)\n\nd \u00b7 \u2202rk\nxj\n\n\u2202xj\nd\n\n4\n\nMethod\n\n\u02dcAi,k\nj\n\n(cid:104)(cid:12)(cid:12)(cid:12) \u2202rk\n\n\u2202xj\nd\n\n(cid:12)(cid:12)(cid:12)(cid:105)WiHiCi\n\nd=1\n\nSaliency Map [28] Gradient * Input [27]\n\n(cid:105)WiHiCi\n\nd=1\n\n(cid:104)\n\n\u0001-LRP [3, 1]\n\n(cid:105)WiHiCi\n\nd \u00b7 \u2202grk\nxj\n\n\u2202xj\nd\n\nd=1\n\n, g = f (z)\nz\n\n\fFor model mi, the produced attribution map \u02dcAi is of the same size as the input \u02dcX, i.e., \u02dcAi \u2208 RWiHiCi.\nWe do the inverse of T to transform the attribution maps back to the same size as the images in the\nprobe data: Ai = T \u22121( \u02dcAi), Ai \u2208 RW HC. As attribution maps of all models are transformed into the\nsame size, the transferability can be computed based on these maps.\n\nStep 3: Estimating Model Transferability. Once step 2 is completed, we have Np attribution\nmaps Ai = {Ai\nj denotes the attribution map of j-th\nimage Xj in X . The model mi can be viewed as a sample in the model space RNW HC, formed by\nconcatenating all the attribution maps. The distance of two models are taken to be\n\n} for each model mi, where Ai\n\n2, ..., Ai\n\n1, Ai\n\nNp\n\nd(mi, mj) =\n\n,\n\n(1)\n\n(cid:80)Np\n\nNp\n\nk=1 cos_sim(Ai\n\nk, Aj\nk)\n\nk) = Ai\n(cid:107)Ai\n\nk\u00b7Aj\nk(cid:107)\u00b7(cid:107)Aj\n\nk\n\nk, Aj\n\nwhere cos_sim(Ai\nk(cid:107). The model transferability map, which measures the pairwise\ntransferability relationships, can then be derived based on these distances. The model transferability,\nas shown by taskonomy [37], is inherently asymmetric. In other words, if model mi ranks \ufb01rst in\nbeing transferred to task tj among all the models (except mj) in M, mj does not necessarily rank\n\ufb01rst in being transferred to task ti. Yet, the proposed model space is symmetric in distance, as we have\nd(mi, mj) = d(mj, mi). We argue that the symmetric property of the distance in the model space\nmakes little negative effect on the transferability relationships, as the task transferability rankings\nof the source tasks are computed by relative comparison of distances. Experiments demonstrate\nthat with the symmetric model space, the proposed method is able to effectively approximate the\nasymmetric transferability relationships produced by taskonomy.\n\n3.4 Ef\ufb01ciency Analysis\n\nHere we make a rough comparison between the ef\ufb01ciency of the proposed approach and that of\ntaskonomy. As we assume task-speci\ufb01c trained models are available, we compare the computation\ncost of our method with that of only the transfer modeling in taskonomy. For taskonomy, let us\nassume the transfer model is trained for E epochs on the training data of size N, then for a task\ndictionary of size T , the computation cost can be approximately denoted as EN T (T \u2212 1)-times\nforward-and-backward propagation1. For our method working on the probe dataset, however, only\none time of forward-and-backward propagation is required. The overall computation cost for building\nthe model space in our method is about T M-times forward-and-backward propagation, where M is\nthe size of the probe dataset and usually M (cid:28) N. The proposed method is thus about EN (T\u22121)\n-times\nmore ef\ufb01cient than taskonomy. This also means the speedup over taskonomy will be even more\nsigni\ufb01cant, if more tasks are involved and hence T enlarges.\nIn our experiments, the proposed method takes about 20 GPU hours to compute the pairwise\ntransferability relationships on one Quadro P5000 card for 20 pre-trained taskonomy models, while\ntaskonomy takes thousands of GPU hours on the cloud2 for the same number of tasks.\n\nM\n\n4 Experiments\n\n4.1 Experimental Settings\n\nPre-trained Models. Two groups of trained models are adopted to validate the proposed method.\nIn the \ufb01rst group, we adopt 20 trained models of single-image tasks released by taskonomy [37], of\nwhich the task relatedness has been constructed and also released. It is used as the oracle to evaluate\nthe proposed method. Note that all these models adopt an encoder-decoder architecture, where the\nencoder is used to extract representations and the decoder makes task predictions. For these models,\nthe attribution maps are computed with respect to the output of the encoder.\nTo further validate the proposed method, we construct a second group of trained models which are\ncollected online. We have managed to obtain 18 trained models in this group: two VGGs [29] (VGG16,\nVGG19), three ResNets [8] (ResNet50, ResNet101, ResNet152), two Inceptions (Inception V3 [32],\n\n1Here for simplicity, we ignore the computation-cost difference caused by the model architectures.\n2As the hardware con\ufb01gurations are not clear here, we list the GPU hours only for perceptual comparison.\n\n5\n\n\fFigure 2: Visualization of attribution maps produced using \u0001-LRP on taskonomy models. Some tasks\nproduce visually similar attribution maps, such as Rgb2depth and Rgb2mnist.\n\nInception ResNet V2 [31]), three MobileNets [10] (MobileNet, 0.5 MobileNet, 0.25 MobileNet),\nfour Inpaintings [35] (ImageNet, CelebA, CelebA-HQ, Places), FCRN [14], FCN [17], PRN [5] and\nTiny Face Detector [11]. All these models are also viewed in an encoder-decoder architecture. The\nsub-model which produces the most compact features is viewed as the encoder and the remainder as\nthe decoder. Similar to taskonomy models, the attribution maps are computed with respect to the\noutput of the encoder. More details of these models can be found in the supplementary material.\n\nProbe Datasets. We build three datasets, taskonomy data [37], indoor scene [20], and COCO [16],\nas the probe data to evaluate our method. The domain difference between taskonomy data and\nCOCO is much larger than that between taskonomy data and indoor scene. For all the three datasets,\nwe randomly select about 1, 000 images to construct the probe datasets. More details of the three\nprobe datasets are provided in the supplementary material. In Section 4.2.3, we demonstrate the\nperformances of the proposed method evaluated on these three probe datasets.\n\n4.2 Experiments on Models in Taskonomy\n\n4.2.1 Visualization of Attribution Maps\n\nWe \ufb01rst visualize the attribution maps produced by various trained models for the same input images.\nTwo examples are given in Figure 2. Attribution maps are produced by \u0001-LRP on taskonomy data.\nFrom the two examples, we can see that some tasks produce visually similar attribution maps. For\nexample, (cid:104)Rgb2depth, Rgb2mist(cid:105)3, (cid:104)Class 1000, Class Places(cid:105) and (cid:104)Denoise, Keypoint 2D(cid:105). In\neach cluster, trained models pay their \u201cattentions\u201d to the similar regions, thus the \u201cknowledge\u201d they\nlearned are intuitively highly correlated (as seen in Section 4.2.2) and can be transferred to each\nother (as seen in Section 4.2.3). Two examples may produce conclusions where the constructed\nmodel transferability deviates from the underlying model relatedness. However, such deviation is\nalleviated by aggregating the results of more examples drawn from the data distribution. For more\nvisualization examples, please see the supplementary material.\n\n4.2.2 Rationality of the Assumption\n\nHere we adopt Singular Vector Canonical Correlation Analysis (SVCCA) [21] to validate the rational-\nity underlying our assumption: if tasks produce similar attribution maps, the representations extracted\nfrom corresponding models should be highly correlated, thus they are expected to yield favorable\ntransfer-learning performance to each other. In SVCCA, each neuron is represented by an activation\nvector: its set of response to a set of inputs and hence the layer can be represented by the subspace\n\n3Here we use (cid:104)(cid:105) to denote a cluster of tasks, of which the attribution maps are highly similar.\n\n6\n\nInputCurvatureDenoiseEdge 2DEdge 3DKeypoint2DKeypoint3DColorizationReshadeInputRgb2depthRgb2mistRgb2sfnormSegment 2DVanishing PtsSegSemanticClass 1000Class Places\fFigure 3: Left: visualization of the correlation matrix from SVCCA. Middle: the difference between\ncorrelation matrix from SVCCA and the transferability matrix derived from attribution maps. Both of\nthem are normalized for better visualization. Right: the Correlation-Priority Curve (CPC).\n\nspanned by the activation vectors of all the neurons in this layer. SVCCA \ufb01rst adopts Singular Value\nDecomposition (SVD) of each subspace to obtain new subspaces that comprise the most important\ndirections of the original subspaces, and then uses Canonical Correlation Analysis (CCA) to compute\na series of correlation coef\ufb01cients between the new subspaces. The overall correlation is measured by\nthe average of these correlation coef\ufb01cients.\nExperimental results on taskonomy data with \u0001-LRP are shown in Figure 3. In the left, the correlation\nmatrix over the pre-trained taskonomy models is visualized. In the middle, we plot the difference\nbetween the correlation matrix and the model transferability matrix derived from attribution maps\nin the proposed method. It can be seen that the values in the difference matrix are in general small,\nimplying that the correlation matrix is highly similar to the model transferability matrix. To further\nquantify the overall similarity between these two matrices, we compute their Pearson correlation\n(\u03c1p = 0.939) and Spearman correlation (\u03c1s = 0.660). All these results show that the similarity of\nattribution maps is a good indicator of the correlation between representations.\nIn addition, we can see that some tasks, like Edge3d and Colorization, tend to be more correlated\nto other tasks, as the colors of the corresponding row or column are darker than those of others,\nwhile some other tasks are not, like Vanishing Point. In taskonomy, the priorities4 of Edge3d,\nColorization and Vanishing Point are 5.4, 5.8 and 14.2, respectively. It indicates that more correlated\nrepresentations tend to be more suitable for transferring learning to each other. To make this\nclearer, we depict the Correlation-Priority Curve (CPC) in the right of Figure 3. In this \ufb01gure,\nfor each priority p shown on the abscissa, the correlation shown on the ordinate is computed as\nj = p)\u03c1i,j, where I is the indicator function and \u03c1i,j is the correlation\ncorrelation(p) = 1\nN\nbetween representations extracted from two models mi and mj. It can be seen that as the priority\nbecomes lower, the average correlation becomes weaker. All these results verify the rationality\nunderlying the assumption.\n\n(cid:80)\n\nI(ri\n\ni(cid:54)=j\n\n4.2.3 Deep Model Transferability\n\nWe adopt two evaluation metrics, P@K and R@K5, which are widely used in the information retrieval\n\ufb01eld, to compare the model transferability constructed from our method with that from tasknomy.\nEach target task is viewed as a query, and its top-5 source tasks that produce the best transferring\nperformances in taskonomy are regarded as relevant to the query. To better understand the results, we\nintroduce one baseline using random ranking, and the oracle, the ideal method which always produces\nthe perfect results. Additionally, we also evaluate SVCCA for computing the model transferability\nrelationships. The experimental results are depicted in Figure 4. Based on the results, we can make\nthe following conclusions.\n\u2022 The topology structure of the model transferability derived from the proposed method is similar to\nthat of oracle. For example, when only top-3 predictions are examined, the precision can be about\n85% on COCO with \u0001-LRP. To see this clearer, we also depict the task similarity tree constructed\n\n(cid:80)N\n\n4The priority of a task i refers to the average ranking when transferred to other tasks: pi = 1\nN\n\nj ri\n\nj, where\n\nj denotes the ranking of task i when transferred to task j. A smaller value of p denotes a higher priority.\nri\n\n5P: precision, R: recall, @K: only the top-K results are examined.\n\n7\n\nAutoencoderCurvatureDenoiseEdge 2DEdge 3DKeypoint 2DKeypoint 3DColorizationReshadeRgb2depthRgb2mistRgb2sfnormRoom LayoutSegment 25DSegment 2D Vanishing PointSegment SemanticClass 1000Class PlacesInpainting WholeAutoencoderCurvatureDenoiseEdge 2DEdge 3DKeypoint 2DKeypoint 3DColorizationReshadeRgb2depthRgb2mistRgb2sfnormRoom LayoutSegment 25DSegment 2D Vanishing PointSegment SemanticClass 1000Class PlacesInpainting Whole1.00.20.80.30.20.50.20.40.10.10.10.20.10.20.40.10.20.10.10.60.21.00.20.30.50.20.50.30.40.30.40.40.10.50.30.10.30.10.10.20.80.21.00.30.20.60.20.40.20.10.10.20.10.20.40.10.20.10.10.70.30.30.31.00.30.20.30.20.20.10.10.20.10.30.30.10.20.10.10.20.20.50.20.31.00.10.40.20.40.40.40.40.20.50.30.10.30.20.20.20.50.20.60.20.11.00.20.30.10.10.10.10.10.20.30.10.20.10.10.40.20.50.20.30.40.21.00.30.40.30.40.50.20.50.30.10.30.20.20.20.40.30.40.20.20.30.31.00.20.20.20.20.20.30.40.10.20.20.20.30.10.40.20.20.40.10.40.21.00.50.40.40.20.40.20.10.30.20.20.10.10.30.10.10.40.10.30.20.51.00.60.40.20.40.20.20.30.20.20.10.10.40.10.10.40.10.40.20.40.61.00.40.20.40.20.20.30.20.20.10.20.40.20.20.40.10.50.20.40.40.41.00.30.50.30.10.30.20.20.20.10.10.10.10.20.10.20.20.20.20.20.31.00.30.20.30.30.40.40.10.20.50.20.30.50.20.50.30.40.40.40.50.31.00.50.30.40.30.30.20.40.30.40.30.30.30.30.40.20.20.20.30.20.51.00.20.30.20.20.30.10.10.10.10.10.10.10.10.10.20.20.10.30.30.21.00.20.30.30.10.20.30.20.20.30.20.30.20.30.30.30.30.30.40.30.21.00.30.30.20.10.10.10.10.20.10.20.20.20.20.20.20.40.30.20.30.31.00.40.10.10.10.10.10.20.10.20.20.20.20.20.20.40.30.20.30.30.41.00.10.60.20.70.20.20.40.20.30.10.10.10.20.10.20.30.10.20.10.11.00.00.20.40.60.81.0AutoencoderCurvatureDenoiseEdge 2DEdge 3DKeypoint 2DKeypoint 3DColorizationReshadeRgb2depthRgb2mistRgb2sfnormRoom LayoutSegment 25DSegment 2D Vanishing PointSegment SemanticClass 1000Class PlacesInpainting WholeAutoencoderCurvatureDenoiseEdge 2DEdge 3DKeypoint 2DKeypoint 3DColorizationReshadeRgb2depthRgb2mistRgb2sfnormRoom LayoutSegment 25DSegment 2D Vanishing PointSegment SemanticClass 1000Class PlacesInpainting Whole0.00.10.00.10.00.10.10.00.00.00.00.10.00.00.10.10.10.20.20.10.10.00.10.10.00.10.40.00.00.10.30.00.00.00.10.00.10.20.20.10.00.10.00.20.00.00.10.00.10.00.10.00.00.10.10.10.00.20.20.10.10.10.20.00.10.30.20.00.00.00.00.00.10.10.00.10.00.20.20.10.00.00.00.10.00.00.10.00.00.10.00.00.00.10.00.00.10.10.10.00.10.10.00.30.00.00.10.00.10.10.00.00.00.00.10.10.00.10.00.10.10.40.10.20.10.10.00.10.00.10.00.10.10.30.10.10.00.10.10.10.00.00.00.00.00.00.10.00.00.00.00.10.10.10.20.00.00.10.20.00.00.00.10.00.00.10.00.00.00.10.10.10.10.10.00.00.10.00.10.00.00.10.00.00.10.10.10.00.10.00.10.10.00.00.00.00.10.10.10.10.00.30.10.00.00.00.00.00.10.10.00.10.10.10.00.00.10.10.10.10.10.00.00.00.00.00.10.10.10.10.10.00.10.10.10.00.00.10.10.00.00.00.00.10.00.00.10.10.10.00.10.10.00.20.10.20.00.30.40.00.00.00.10.10.10.00.30.10.10.00.10.10.20.00.20.10.10.10.10.00.10.10.10.00.00.10.10.20.00.00.00.10.10.20.00.00.10.10.00.10.10.00.10.10.00.10.10.00.00.00.00.00.20.10.00.00.00.10.20.10.10.10.00.00.10.00.00.00.10.10.10.00.00.10.10.00.00.00.20.10.20.20.20.20.10.10.10.10.00.10.10.10.30.10.10.10.00.00.00.10.20.20.20.20.10.00.10.20.10.10.10.10.40.10.00.20.20.00.00.10.10.10.10.10.00.10.10.00.00.10.10.00.00.00.10.10.10.10.10.00.00.20.40.60.81.012345678910111213141516171819Priority0.00.10.20.30.40.50.60.70.80.9Correlation0.770.710.760.760.730.710.680.590.560.470.460.440.460.460.480.420.420.420.31\fFigure 4: From left to right: P@K curve, R@K curve and task similarity tree constructed by \u0001-LRP.\nResults of SVCCA are produced using validation data from taskonomy.\n\nby agglomerative hierarchical clustering in Figure 4. This tree is again highly similar to that of\ntaskonomy where 3D, 2D, geometric, and semantic tasks cluster together.\n\u2022 \u0001-LRP and gradient* input generally produce better performance than saliency. This phenomenon\ncan be in part explained by the fact that saliency generates attributions entirely based on gradients\nthat denote the direction for optimization. However, the gradients are not able to fully re\ufb02ect the\nrelevance between the inputs and the outputs of the deep model, thus leading to inferior results. It\nalso implies the attribution method can affect the performance of our method. Devising better\nattribution methods may further promote the accuracy of our method, which is left as future work.\n\u2022 The proposed method works quite well on the probe data from different domains, such as indoor\nscene and COCO. It implies that the proposed method is robust to different choices of the probe\ndata to some degree, which makes the data collection effortless. Furthermore, it can be seen\nthat the probe data from indoor scene and COCO surprisingly better predict the taskonomy\ntransferability than the probe data from taskonomy data. We conjecture that more complex\ntextures disentangle the attributions better, thus the probe data from COCO and indoor scene\nwhich are generally more complex in texture yield superior results to taskonomy as probe data.\nHowever, more research is necessary to discover if the explanation holds in general.\n\u2022 SVCCA also works well in estimating the transferability of taskonomy models. However, the\nproposed method yields superior or comparable performance to SVCCA when using gradient *\ninput and \u0001-LRP for attribution. What\u2019s more, as the proposed method measures transferability by\ncomputing distances, it is several times more ef\ufb01cient than SVCCA, especially when the hidden\nrepresentation is large in dimension or a new task is added into a large task dictionary.\n\nWith all these observations and the fact that the proposed method is signi\ufb01cantly more ef\ufb01cient than\ntaskonomy, the proposed method is indeed an effectual substitute for taskonomy, especially when\nhuman annotations are unavailable, when the model library is large in size, or when frequent model\ninsertion and update takes place.\n\n4.3 Experiments on Models beyond Taskonomy\n\nTo give a more comprehensive view of the proposed method, we also conduct experiments on the\nonline collected pre-trained models beyond taskonomy. Results are shown in Figure 5. The left two\nsub\ufb01gures show the correlation matrix from SVCCA and the model transferability matrix produced\nby our method. The right two sub\ufb01gures depict the task similarity trees produced by SVCCA and the\nproposed method. The classi\ufb01cation and inpainting models are listed in different colors. We have the\nfollowing observations.\n\u2022 The proposed method produces an af\ufb01nity matrix and a task similarity tree alike those derived\nfrom SVCCA, although the collected models are heterogeneous in architectures, tasks, and input\nsize. These results further validate that models producing similar attribution maps also produce\nhighly correlated representations.\n\u2022 All the ImageNet-trained classi\ufb01cation models, despite their different architectures, tend to cluster\ntogether. Furthermore, the same-task trained models with the similar architectures tend to be more\nrelated than with dissimilar architectures. For example, ResNet50 is more related to ResNet101\nand ResNet152 than VGG, MobileNet and Inception models, indicating that the architecture plays\na certain role in regularization for solving the tasks.\n\n8\n\n12345678910111213141516171819K0.30.40.50.60.70.80.91.0PrecisionP@K Curvetaskonomy_saliencytaskonomy_grad*inputtaskonomy_elrpcoco_saliencycoco_grad*inputcoco_elrpindoor_saliencyindoor_grad*inputindoor_elrpsvccarandom rankingoracle12345678910111213141516171819K0.20.40.60.81.0RecallR@K Curvetaskonomy_saliencytaskonomy_grad*inputtaskonomy_elrpcoco_saliencycoco_grad*inputcoco_elrpindoor_saliencyindoor_grad*inputindoor_elrpsvccarandom rankingoracle0.40.50.60.70.80.911.11.2 Segment Semantic Curvature Keypoint 3D Segment 25D Edge 3D Rgb2sfnorm Reshade Rgb2depth Rgb2mist Room Layout Segment 2D Keypoint 2D Denoise Autoencoder Inpainting Whole Colorization Edge 2D Class Places Class 1000 Vanishing PointTask Similarity Tree\fFigure 5: Results on collected models beyond taskonomy. From left to right: af\ufb01nity matrix from\nSVCCA, af\ufb01nity matrix from attribution maps, task similarity tree from SVCCA, and task similarity\ntree from attribution maps.\n\n\u2022 The inpainting models, albeit trained on data from different data domain, also tend to cluster\ntogether. It implies that different models of the same task, albeit trained on data from different\ndata domain, tend to play similar role in transfer learning. However, more research is necessary\nto verify if this observation holds in general.\n\nWe also merge the two groups into one to further evaluate the proposed method, of which the results\nare provided in the supplementary material, providing us with more insights on model transferability.\n\n5 Conclusion\n\nWe introduce in this paper an embarrassingly simple yet ef\ufb01cacious approach towards estimating the\ntransferability between deep models, without using any human annotation. Speci\ufb01cally, we project\nthe pre-trained models of interest into a model space, wherein each model is treated as a point and the\ndistance between two points are used to approximate their transferability. The projection to the model\nspace is achieved by computing the attribution maps from the unlabelled probe dataset. The proposed\napproach imposes no constraints on the architectures on the models, and turns out to be robust to the\nselection of the probe data. Despite the lightweight construction, it yields a transferability map highly\nsimilar to the one obtained by taskonomy yet runs at a speed several magnitudes faster, and therefore\nmay serve as a compact and express transferability estimation, especially when no annotations are\navailable, the model library is large in size, or frequent model insertion or update takes place.\n\nAcknowledgments\n\nThis work is supported by National Key Research and Development Program (2016YFB1200203),\nNational Natural Science Foundation of China (61572428), Key Research and Development Program\nof Zhejiang Province (2018C01004), and the Major Scientifc Research Project of Zhejiang Lab (No.\n2019KD0AC01).\n\nReferences\n[1] Marco B Ancona, Enea Ceolini, Cengiz Oztireli, and Markus H. Gross. Towards better understanding of\ngradient-based attribution methods for deep neural networks. In International Conference on Learning\nRepresentations, 2018.\n\n[2] H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson. Factors of transferability for a generic\nconvnet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9):1790\u20131802,\nSep. 2016.\n\n[3] Sebastian Bach, Alexander Binder, Gr\u00e9goire Montavon, Frederick Klauschen, Klaus-Robert M\u00fcller,\nWojciech Samek, and Oscar Deniz Suarez. On pixel-wise explanations for non-linear classi\ufb01er decisions\nby layer-wise relevance propagation. In PloS one, 2015.\n\n[4] Kshitij Dwivedi and Gemma Roig. Representation similarity analysis for ef\ufb01cient task taxonomy & transfer\n\nlearning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.\n\n[5] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. Joint 3d face reconstruction and dense\n\nalignment with position map regression network. In European Conference on Computer Vision, 2018.\n\n[6] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born-\n\nagain neural networks. In International Conference on Machine Learning, pages 1602\u20131611, 2018.\n\n9\n\nInception v3Inception ResNet v2VGG16VGG19ResNet50ResNet101ResNet152MobileNet0.5 MobileNet0.25 MobileNetFCRNSemantic SegmentInpainting ImageNetInpainting CelebAInpainting CelebA-HQInpainting PlacesFace DetectionFace AlignmentInception v3Inception ResNet v2VGG16VGG19ResNet50ResNet101ResNet152MobileNet0.5 MobileNet0.25 MobileNetFCRNSemantic SegmentInpainting ImageNetInpainting CelebAInpainting CelebA-HQInpainting PlacesFace DetectionFace Alignment1.00.80.50.50.30.30.30.20.10.30.20.20.10.20.20.20.20.30.81.00.50.50.30.30.30.20.10.30.20.20.10.20.20.20.20.30.50.51.00.80.40.40.40.20.10.40.30.20.10.20.20.20.20.30.50.50.81.00.40.40.40.20.10.40.30.20.10.20.20.20.20.30.30.30.40.41.00.60.60.20.10.40.20.20.10.20.20.20.30.40.30.30.40.40.61.00.60.20.10.40.20.20.10.20.20.20.30.30.30.30.40.40.60.61.00.20.10.40.20.20.10.20.20.20.20.30.20.20.20.20.20.20.21.00.40.70.20.30.10.20.20.10.20.40.10.10.10.10.10.10.10.41.00.40.10.10.10.10.10.10.10.10.30.30.40.40.40.40.40.70.41.30.30.40.10.30.30.30.20.70.20.20.30.30.20.20.20.20.10.31.00.20.10.20.20.20.30.30.20.20.20.20.20.20.20.30.10.40.21.00.10.20.20.10.10.40.10.10.10.10.10.10.10.10.10.10.10.11.00.20.20.20.10.10.20.20.20.20.20.20.20.20.10.30.20.20.21.00.80.80.10.20.20.20.20.20.20.20.20.20.10.30.20.20.20.81.00.80.20.30.20.20.20.20.20.20.20.10.10.30.20.10.20.80.81.00.10.20.20.20.20.20.30.30.20.20.10.20.30.10.10.10.20.11.00.20.30.30.30.30.40.30.30.40.10.70.30.40.10.20.30.20.21.00.00.20.40.60.81.0Inception v3Inception ResNet v2VGG16VGG19ResNet50ResNet101ResNet152MobileNet0.5 MobileNet0.25 MobileNetFCRNSemantic SegmentInpainting ImageNetInpainting CelebAInpainting CelebA-HQInpainting PlacesFace DetectionFace AlignmentInception v3Inception ResNet v2VGG16VGG19ResNet50ResNet101ResNet152MobileNet0.5 MobileNet0.25 MobileNetFCRNSemantic SegmentInpainting ImageNetInpainting CelebAInpainting CelebA-HQInpainting PlacesFace DetectionFace Alignment1.00.70.50.40.30.30.30.40.60.30.20.20.20.20.20.10.20.40.71.00.40.40.30.30.30.30.40.20.20.20.10.20.20.10.20.30.50.41.00.80.40.30.30.30.40.30.20.20.20.20.20.10.20.40.40.40.81.00.40.30.30.30.40.30.20.20.20.20.20.10.20.30.30.30.40.41.00.80.80.20.30.20.10.40.20.20.20.10.20.20.30.30.30.30.81.00.80.20.30.20.10.40.10.10.20.10.20.20.30.30.30.30.80.81.00.20.20.20.10.30.10.10.20.10.20.20.40.30.30.30.20.20.21.00.60.40.30.20.20.20.20.10.20.40.60.40.40.40.30.30.20.61.00.50.30.20.20.20.30.10.20.60.30.20.30.30.20.20.20.40.51.00.30.10.10.10.20.10.20.40.20.20.20.20.10.10.10.30.30.31.00.10.10.10.20.10.10.30.20.20.20.20.40.40.30.20.20.10.11.00.10.10.20.10.20.20.20.10.20.20.20.10.10.20.20.10.10.11.00.80.50.20.10.20.20.20.20.20.20.10.10.20.20.10.10.10.81.00.50.10.10.20.20.20.20.20.20.20.20.20.30.20.20.20.50.51.00.10.20.20.10.10.10.10.10.10.10.10.10.10.10.10.20.10.11.00.10.10.20.20.20.20.20.20.20.20.20.20.10.20.10.10.20.11.00.20.40.30.40.30.20.20.20.40.60.40.30.20.20.20.20.10.21.00.00.20.40.60.81.00.20.40.60.811.21.4 Inception ResNet v2 Inception v3 VGG16 VGG19 ResNet152 ResNet101 ResNet50 Face Detection FCRN Semantic Segment Face Alignment 0.25 MobileNet MobileNet 0.5 MobileNet Inpainting ImageNet Inpainting Places Inpainting CelebA Inpainting CelebA-HQ0.20.40.60.811.21.4 Inpainting Places Face Detection Semantic Segment ResNet152 ResNet101 ResNet50 VGG19 VGG16 Inception v3 Inception ResNet v2 Face Alignment 0.5 MobileNet MobileNet 0.25 MobileNet FCRN Inpainting CelebA-HQ Inpainting CelebA Inpainting ImageNet\f[7] Kaiming He, Ross B. Girshick, and Piotr Doll\u00e1r. Rethinking imagenet pre-training. CoRR, abs/1811.08883,\n\n2018.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\n2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770\u2013778, 2016.\n\n[9] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv\n\npreprint arXiv:1503.02531, 2015.\n\n[10] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,\nMarco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile\nvision applications. CoRR, abs/1704.04861, 2017.\n\n[11] Peiyun Hu and Deva Ramanan. Finding tiny faces. 2017 IEEE Conference on Computer Vision and Pattern\n\nRecognition (CVPR), pages 1522\u20131530, 2017.\n\n[12] Mi-Young Huh, Pulkit Agrawal, and Alexei A. Efros. What makes imagenet good for transfer learning?\n\nCoRR, abs/1608.08614, 2016.\n\n[13] Simon Kornblith, Jon Shlens, and Quoc V. Le. Do better imagenet models transfer better? In Computer\n\nVision and Pattern Recognition, 2019.\n\n[14] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth\nprediction with fully convolutional residual networks. In 3D Vision (3DV), 2016 Fourth International\nConference on, pages 239\u2013248. IEEE, 2016.\n\n[15] Xu Lan, Xiatian Zhu, and Shaogang Gong. Knowledge distillation by on-the-\ufb02y native ensemble. In\n\nAdvances in Neural Information Processing Systems, pages 7527\u20137537, 2018.\n\n[16] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays,\nPietro Perona, Deva Ramanan, Piotr Doll\u00e1r, and C. Lawrence Zitnick. Microsoft coco: Common objects in\ncontext. In European Conference on Computer Vision, 2014.\n\n[17] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\ntion. In 2015 IEEE Conference on Computer Vision and Pattern Recognition, 2015.\n\n[18] Gr\u00e9goire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert M\u00fcller.\nExplaining nonlinear classi\ufb01cation decisions with deep taylor decomposition. Pattern Recognition, 65:211\u2013\n222, 2017.\n\n[19] Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Actor-mimic: Deep multitask and transfer\n\nreinforcement learning. arXiv preprint arXiv:1511.06342, 2015.\n\n[20] Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In Computer Vision and Pattern\n\nRecognition, 2009.\n\n[21] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical\ncorrelation analysis for deep learning dynamics and interpretability. In Advances in Neural Information\nProcessing Systems, pages 6076\u20136085, 2017.\n\n[22] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf:\nAn astounding baseline for recognition. In Proceedings of the 2014 IEEE Conference on Computer Vision\nand Pattern Recognition Workshops, pages 512\u2013519, 2014.\n\n[23] Stephan R. Richter, Zeeshan Hayder, and Vladlen Koltun. Playing for benchmarks.\n\nInternational Conference on Computer Vision (ICCV), Oct 2017.\n\nIn The IEEE\n\n[24] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua\n\nBengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.\n\n[25] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray\nKavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arX-\niv:1606.04671, 2016.\n\n[26] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagat-\n\ning activation differences. In International Conference on Machine Learning, 2017.\n\n[27] Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box:\n\nLearning important features through propagating activation differences. CoRR, abs/1605.01713, 2016.\n\n10\n\n\f[28] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising\n\nimage classi\ufb01cation models and saliency maps. CoRR, abs/1312.6034, 2013.\n\n[29] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. CoRR, abs/1409.1556, 2015.\n\n[30] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International\n\nConference on Machine Learning, 2017.\n\n[31] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-\nresnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Arti\ufb01cial\nIntelligence, 2017.\n\n[32] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking\nthe inception architecture for computer vision. 2016 IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), pages 2818\u20132826, 2016.\n\n[33] A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Computer Vision and Pattern Recognition,\n\npages 1521\u20131528, 2011.\n\n[34] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural\nnetworks? In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems, pages 3320\u20133328. 2014.\n\n[35] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern\n\nwith contextual attention.\nRecognition, pages 5505\u20135514, 2018.\n\n[36] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance\n\nof convolutional neural networks via attention transfer. CoRR, abs/1612.03928, 2017.\n\n[37] Amir R. Zamir, Alexander Sax, William Shen, Leonidas J. Guibas, Jitendra Malik, and Silvio Savarese.\nTaskonomy: Disentangling task transfer learning. In The IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), June 2018.\n\n[38] Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European\n\nConference on Computer Vision, 2014.\n\n[39] Jian Zhou and Olga G. Troyanskaya. Predicting effects of noncoding variants with deep learning\u2013based\n\nsequence model. Nature Methods, 12:931\u2013934, 2015.\n\n[40] Luisa M. Zintgraf, Taco Cohen, Tameem Adel, and Max Welling. Visualizing deep neural network\ndecisions: Prediction difference analysis. In International Conference on Learning Representations, 2017.\n\n11\n\n\f", "award": [], "sourceid": 3328, "authors": [{"given_name": "Jie", "family_name": "Song", "institution": "Zhejiang University"}, {"given_name": "Yixin", "family_name": "Chen", "institution": "Zhejiang University"}, {"given_name": "Xinchao", "family_name": "Wang", "institution": "Stevens Institute of Technology"}, {"given_name": "Chengchao", "family_name": "Shen", "institution": "Zhejiang University"}, {"given_name": "Mingli", "family_name": "Song", "institution": "Zhejiang University"}]}