{"title": "Transductive Zero-Shot Learning with Visual Structure Constraint", "book": "Advances in Neural Information Processing Systems", "page_first": 9972, "page_last": 9982, "abstract": "To recognize objects of the unseen classes, most existing Zero-Shot Learning (ZSL) methods first learn a compatible projection function between the common semantic space and the visual space based on the data of source seen classes, then directly apply it to the target unseen classes. However, in real scenarios, the data distribution between the source and target domain might not match well, thus causing the well-known domain shift problem. Based on the observation that visual features of test instances can be separated into different clusters, we propose a new visual structure constraint on class centers for transductive ZSL, to improve the generality of the projection function (\\ie alleviate the above domain shift problem). Specifically, three different strategies (symmetric Chamfer-distance,Bipartite matching distance, and Wasserstein distance) are adopted to align the projected unseen semantic centers and visual cluster centers of test instances. We also propose a new training strategy to handle the real cases where many unrelated images exist in the test dataset, which is not considered in previous methods. Experiments on many widely used datasets demonstrate that the proposed visual structure constraint can bring substantial performance gain consistently and achieve state-of-the-art results.", "full_text": "Transductive Zero-Shot Learning with Visual\n\nStructure Constraint\n\nZiyu Wan\u22171, Dongdong Chen\u22172, Yan Li3, Xingguang Yan4\n\nJunge Zhang5, Yizhou Yu6, Jing Liao\u20201\n\n1 City University of Hong Kong 2 Microsoft Cloud+AI\n\n3 PCG, Tencent 4 Shenzhen University 5 NLPR, CASIA 6 Deepwise AI Lab\n\nAbstract\n\nTo recognize objects of the unseen classes, most existing Zero-Shot Learning(ZSL)\nmethods \ufb01rst learn a compatible projection function between the common semantic\nspace and the visual space based on the data of source seen classes, then directly\napply it to the target unseen classes. However, in real scenarios, the data distribution\nbetween the source and target domain might not match well, thus causing the well-\nknown domain shift problem. Based on the observation that visual features of test\ninstances can be separated into different clusters, we propose a new visual structure\nconstraint on class centers for transductive ZSL, to improve the generality of the\nprojection function (i.e.alleviate the above domain shift problem). Speci\ufb01cally,\nthree different strategies (symmetric Chamfer-distance, Bipartite matching distance,\nand Wasserstein distance) are adopted to align the projected unseen semantic centers\nand visual cluster centers of test instances. We also propose a new training strategy\nto handle the real cases where many unrelated images exist in the test dataset, which\nis not considered in previous methods. Experiments on many widely used datasets\ndemonstrate that the proposed visual structure constraint can bring substantial\nperformance gain consistently and achieve state-of-the-art results. The source code\nis available at https://github.com/raywzy/VSC.\n\n1\n\nIntroduction\n\nRelying on massive labeled training datasets, signi\ufb01cant progress has been made for image recognition\nin the past years [12]. However, it is unrealistic to label all the object classes, thus making these\nsupervised learning methods struggle to recognize objects which are unseen during training. By\ncontrast, Zero-Shot Learning (ZSL) [24, 38, 40] only requires labeled images of seen classes (source\ndomain), and are capable of recognizing images of unseen classes (target domain). The seen and\nunseen domains often share a common semantic space, which de\ufb01nes how unseen classes are\nsemantically related to seen classes. The most popular semantic space is based on semantic attributes,\nwhere each seen or unseen class is represented by an attribute vector. Besides the semantic space,\nimages of the source and target domains are also related and represented in a visual feature space.\nTo associate the semantic space and the visual space, existing methods often rely on the source domain\ndata to learn a compatible projection function to map one space to the other, or two compatible\nprojection functions to map both spaces into one common embedding space. During test time, to\nrecognize an image in the target domain, semantic vectors of all unseen classes and the visual feature\nof this image would be projected into the embedding space using the learned function, then nearest\nneighbor (NN) search will be performed to \ufb01nd the best match class. However, due to the existence of\n\n\u2217Equal contribution. Email: ziyuwan2-c@my.cityu.edu.hk, cddlyf@gmail.com\n\u2020The corresponding author. Email: jingliao@cityu.edu.hk\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe distribution difference between the source and target domains in most real scenarios, the learned\nprojection function often suffers from the well-known domain shift problem.\nTo compensate for this domain gap, transductive zero-shot learning [9] assumes that the semantic\ninformation (e.g.attributes) of unseen classes and visual features of all test images are known in\nadvance. Different ways like domain adaption [15] and label propagation [39] are well investigated\nto better leverage this extra information. Recently, Zhang et al.[38] \ufb01nd that visual features of unseen\ntarget instances can be separated into different clusters even though their labels are unknown as shown\nin Figure 1. By incorporating this prior as a regularization term, a better label assignment matrix can\nbe solved with a non-convex optimization procedure. However, their method still has three main\nlimitations: 1) This visual structure prior is not used to learn a better projection, which directly limits\nthe upper bound of the \ufb01nal performance. 2) They model the ZSL problem as a less-scalable batch\nmode, which requires reoptimization when adding new test data. 3) Like most previous transductive\nZSL methods, they have not considered the real cases where many unrelated images may exist in the\ntest dataset and make the above prior invalid.\nConsidering the \ufb01rst problem, we model the above visual structure prior as a new constraint to learn\na better projection function rather than use the pre-de\ufb01ned one. In this paper, we adopt the visual\nspace as the embedding space and project the semantic space into it. To learn the projection function,\nwe not only use the projection constraint of the source domain data as [35] but also impose the\naforementioned visual structure constraint of the target domain data. Speci\ufb01cally, during training,\nwe \ufb01rst project all the unseen semantic classes into the visual space, then consider three different\nstrategies (\u201cChamfer-distance based\", \u201cBipartite matching based\" and \u201cWasserstein-distance based\u201d)\nto align the projected unseen semantic centers and the visual centers. However, due to the lack\nof labels of test instances in the ZSL setting, we approximate these real visual centers with some\nunsupervised clustering algorithms (e.g.K-Means). Need to note that in our method, we directly apply\nthe learned projection function to the online-mode testing, which is more friendly to real applications\nwhen compared to the batch mode in [38].\nFor the third problem of real application scenarios, since many unrelated images, which belong to\nneither seen nor unseen classes, often exist in the target domain, using current unsupervised clustering\nalgorithms directly on the whole test dataset will generate invalid visual centers, thus misguiding\nthe learning of the project functions. To overcome this problem, we further propose a new training\nstrategy which \ufb01rst \ufb01lters out the highly unrelated images and then uses the remaining ones to\nimpose the proposed visual constraint. To the best of our knowledge, we are the \ufb01rst to consider this\ntransductive ZSL con\ufb01guration with unrelated test images.\nWe demonstrate the effectiveness of the proposed visual structure constraint on many different\nwidely-used datasets. Experiments show that the proposed visual structure constraint consistently\nbrings substantial performance gain and achieves state-of-the-art results.\nTo summarize, our contributions are three-fold as below:\n\nlearning of transductive ZSL to alleviate its domain shift problem.\n\n\u2022 We have proposed three different types of visual structure constraint for the projection\n\u2022 We introduce a new transductive ZSL con\ufb01guration where many unrelated images exist in\n\u2022 Experiments demonstrate that the proposed visual structure constraint can bring substantial\n\nthe test dataset and propose a new training strategy to make our method work for it.\n\nperformance gain consistently and achieve state-of-the-art results.\n\n2 Related Work\nUnlike supervised image recognition [12] which relies on large-scale human annotations and cannot\ngeneralize to unseen classes, ZSL bridges the gap between training seen classes and test unseen\nclasses via different kinds of semantic spaces. Among them, the most popular and effective one is\nthe attribute-based semantic space [17], which is often designed by experts. To incorporate more\nattributes and save human labor, the text description-based [27] and word vector-based semantic\nspace [7] are also proposed. Though the effectiveness of the proposed structure constraint is only\ndemonstrated with the attribute semantic space by default, it should be general to all these spaces.\nTo relate the visual feature of test images and semantic attribute of unseen classes, three different\nembedding spaces are used by existing zero-shot learning methods: the original semantic space,\n\n2\n\n\fi , ys\n\ni }Nu\n\ni is an image and ys\n\nthe original visual space, and a learned common embedding space. Correspondingly, a projection\nfunction is learned from the visual space to the semantic space [27] or from the semantic space to the\nvisual space [35], or learn two projection functions from semantic and visual space to the common\nembedding space [4, 22] respectively. Our method uses the visual space as the embedding space,\nbecause it can help to alleviate the hubness problem [26] as shown in [35]. More importantly, our\nstructure constraint is based on the separability of the visual features of unseen classes.\nRecently, to alleviate the domain shift problem, transductive approaches [9, 19] are proposed to\nleverage test-time unseen data in the learning stage. For example, unsupervised domain adaption is\nused in [15], and transductive multi-class and multi-label ZSL are proposed in [9]. Our method also\nbelongs to transductive approaches, and the proposed visual structure constraint is inspired by [38],\nbut we have addressed their aforementioned drawbacks and improved the performance signi\ufb01cantly.\n3 Method\nProblem De\ufb01nition In ZSL setting, we have Ns source labeled samples Ds \u2261 {(xs\ni )}Ns\ni=1,\ni \u2208 Ys = {1, . . . , S} is the corresponding label within total S source\nwhere xs\nclasses. We are also given Nu unlabeled target samples Du \u2261 {xu\ni=1 that are from target classes\nYu = {S + 1, . . . , S + U}. According to the de\ufb01nition of ZSL, there is no overlap between source\nseen classes Ys and target unseen classes Yu, i.e.Ys \u2229 Yu = \u2205. But they are associated in a common\nsemantic space, which is the knowledge bridge between the source and target domain. As explained\nbefore, we adopt semantic attribute space here, where each class z \u2208 Ys \u222a Yu is represented with a\npre-de\ufb01ned auxiliary attribute vector az \u2208 A. The goal of ZSL is to predict the label yu\ni \u2208 Yu given\ni with no labeled training data.\nxu\nBesides the semantic representations, images of the source and target domains are also represented\nwith their corresponding features in a common visual space. To relate these two spaces, projection\nfunctions are often learned to project these two spaces into a common embedding space. Following\n[35], we directly use the visual space as the embedding space, in which case only one projection\nfunction is needed. The key problem then becomes how to learn a better projection function.\nMotivation Our method is inspired by [38],\nwhose idea is shown in Figure 1: thanks to the\npowerful discriminativity of pre-trained CNN,\nthe visual features of test images can be sep-\narated into different clusters. We denote the\ncenters of these clusters as real centers. We be-\nlieve that if we have a perfect projection function\nto project the semantic attributes to the visual\nspace, the projected points (called synthetic cen-\nters) should align with real centers. However,\ndue to the domain shift problem, the projection\nfunction learned on the source domain is not per-\nfect so that the synthetic centers (i.e.VCL centers\nin Figure 1) will deviate from real centers, and\nthen NN search among these deviated centers\nto assign labels will cause inferior ZSL perfor-\nmance. Based on the above analysis, besides\nsource domain data, we attempt to take advan-\ntage of the existing discriminative structure of\ntarget unseen class clusters during the learning\nof the projection function, i.e.the learned pro-\njection function should also align the synthetic\ncenters with the real ones in the target domain.\n\nFigure 1: Visualization of CNN feature distribution of\n10 target unseen classes on AwA2 dataset using t-SNE,\nwhich can be clearly clustered into several real centers\n(stars). Squares (VCL) are synthetic centers projected by\nthe projection function learned only from source domain\ndata. By incorporating our visual structure constraint,\nour method (BMVSc) can help to learn better projection\nfunction and the generated synthetic semantic centers\nwould be much closer to the real visual centers.\n\n3.1 Visual Center Learning (VCL)\n\nIn this section, we \ufb01rst introduce a baseline\nmethod which learns the projection function\nonly with source domain data. Speci\ufb01cally, a\nCNN feature extractor \u03c6(\u00b7) is used to convert each image x into a d-dimensional feature vector\n\n3\n\nVCL CenterReal CenterBMVSc CenterK-Means Center\fS(cid:88)\n\n\u03c6(x) \u2208 Rd\u00d71. According to the above analysis, each class i of source domain should have a real\nvisual center cs\ni , which is de\ufb01ned as the mean of all feature vectors in the corresponding class. For the\nprojection function, a two-layer embedding network is utilized to transfer source semantic attribute\ni to generate corresponding synthetic center csyn,s\nas\n= \u03c32(wT\n\n(1)\nwhere \u03c31(\u00b7) and \u03c32(\u00b7) denote non-linear operation (Leaky ReLU with negative slope of 0.2 by default).\nw1 and w2 are the weights of two fully connected layers to be learned.\nSince the correspondence relationship is given in the source domain, we directly adopt the simple\nmean square loss to minimize the distance between synthetic centers csyn and real centers c in the\nvisual feature space:\n\n2 \u03c31(wT\n\ncsyn,s\ni\n\n:\n\n1 as\n\ni ))\n\ni\n\nLM SE =\n\n1\nS\n\n(cid:107)csyn,s\n\n\u2212 cs\ni(cid:107)2\n2 + \u03bb\u03a8(w1, w2)\n\ni\n\ni=1\n\n(2)\nwhere \u03a8(\u00b7) is the L2-norm parameter regularizer decreasing the model complexity, we empirically\nset \u03bb = 0.0005. Need to note that different from [35] which trains with a large number of individual\ninstances of each class i, we choose to utilize a single cluster center cs\ni to represent each object\nclass, and train the model with just several center points. It is based on the observation that the\ninstances of the same category could form compact clusters, and will make our method much more\ncomputationally ef\ufb01cient.\nWhen performing ZSL prediction, we \ufb01rst project the semantic attributes of each unseen class i to its\ncorresponding synthetic visual center csyn,u\nusing the learned embedding network as in Equation (1).\nk, its classi\ufb01cation result i\u2217 can be achieved by selecting the nearest synthetic\nThen for a test image xu\ncenter in the visual space. Formally,\n\ni\n\ni\u2217 = argmin\n\ni\n\n(cid:107)\u03c6(xu\n\nk) \u2212 csyn,u\n\ni\n\n(cid:107)2\n\n(3)\n\n3.2 Chamfer-Distance-based Visual Structure Constraint(CDVSc)\n\nAs discussed earlier, the domain shift problem will cause the target synthetic centers csyn,u deviated\nfrom the real centers cu, thus yields poor performance. Intuitively, if we also require the projected\nsynthetic centers to align with the real ones by using the target domain dataset during the learning\nprocess, a better projection function can be learned. However, due to the lack of the label information\nof the target domain, it is impossible to directly get real centers of unseen classes. Considering the\nfact that the visual features of unseen classes can be separated into different clusters, we try to utilize\nsome unsupervised clustering algorithms (K-means by default) to get approximated real centers. To\nvalid it, we plot the K-means centers in Figure 1, which are very close to the real ones.\nAfter obtaining the cluster centers, aligning the structure of cluster centers to that of synthetic centers\ncan be formulated as reducing the distance between the two unordered high-dimensional point sets.\nInspired by the work in 3D point clouds [6], a symmetric Chamfer-distance constraint is proposed to\nsolve the structure matching problem:\n\nLCD =\n\nminy\u2208Cclu,u (cid:107)x \u2212 y(cid:107)2\n\n2 +\n\nminx\u2208Csyn,u (cid:107)x \u2212 y(cid:107)2\n\n2\n\n(4)\n\n(cid:88)\n\nx\u2208Csyn,u\n\n(cid:88)\n\ny\u2208Cclu,u\n\nwhere C clu,u indicates the cluster centers of unseen classes obtained by K-means algorithm. C syn,u\nrepresents the synthetic target centers obtained with the learned projection. Combining the above\nconstraint, the \ufb01nal loss function to train the embedding network is de\ufb01ned as:\n\nLCDV Sc = LM SE + \u03b2 \u00d7 LCD\n\n(5)\n\n3.3 Bipartite-Matching-based Visual Structure Constraint(BMVSc)\n\nCDVSc helps to preserve the structure similarity of two sets, but sometimes many-to-one matching\nmay happen with the Chamfer-distance constraint. This con\ufb02icts with the important prior in ZSL\nthat the obtained matching relation between synthetic and real centers should conform to the strict\none-to-one principle. When undesirable many-to-one matching arises, the synthetic centers will be\npulled to incorrect real centers and result in inferior performance. To address this issue, we change\n\n4\n\n\fCDVSc to bipartite matching based visual structure constraint (BMVSc), which aims to \ufb01nd a global\nminimum distance between the two sets meanwhile to satisfy the strict one-to-one matching principle.\nWe \ufb01rst consider a graph G = (V, E) with two partitions A and B, where A is the set of all synthetic\ncenters Csyn, u and B contains all cluster centers of target classes. Let disij \u2208 D denotes the\nEuclidean distance between i \u2208 A and j \u2208 B, element xij of the assignment matrix X de\ufb01nes the\nmatching relationship between i and j. To \ufb01nd a one-to-one minimum matching between real and\nsynthetic centers, we could formulate it as a min-weight perfect matching problem, and optimize the\nproblem as follows:\n\nLBM = min\n\nX\n\ndisijxij,\n\ns.t.\n\nxij = 1,\n\ni,j\n\nj\n\ni\n\nxij = 1, xij \u2208 {0, 1}\n\n(6)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nIn this formulation, the assignment matrix X strictly conforms to the one-to-one principle. To solve\nthis linear programming problem, we employ Kuhn-Munkres algorithm whose time complexity is\nO(V 2E). Like CDVSc, we also combine the MSE loss and this bipartite matching loss\n\nLBM V Sc = LM SE + \u03b2 \u00d7 LBM\n\n(7)\n\n3.4 Wasserstein-Distance-based Visual Structure Constraint(WDVSc)\n\n(cid:88)\n\ni,j\n\nIdeally, if the synthetic and real centers are compact and accurate, the above bipartite matching\nbased distance can achieve a global optimal matching. However, this assumption is not always valid,\nespecially for the approximated cluster centers of target classes, because these centers may contain\nnoises and are not accurate enough. Therefore, instead of using a hard-value (0 or 1) assignment\nmatrix X, a soft-value X whose values represent the joint probability distribution between these\ntwo point sets is further considered by using the Wasserstein distance. In the optimal transport\ntheory, Wasserstein distance is demonstrated as a good metric for measuring the distance between\ntwo discrete distributions, whose goal is to \ufb01nd the optimal \u201ccoupling matrix\u201d X that achieves the\nminimum matching distance. Its objective formulation is the same as Equation (6), but X represents\nthe soft joint probability values rather than {0, 1}. In this paper, in order to make this optimization\nproblem convex and solve it more ef\ufb01ciently, we adopt the entropy-regularized optimal transport\nproblem by using the Sinkhorn iterations [5].\n\nwhere H(X) is the entropy of matrix H(X) = \u2212(cid:80)\n\nLW D = min\n\nX\n\ndisijxij \u2212 \u0001H(X)\n\nij xijlogxij, \u0001 is the regularization coef-\n\ufb01cient to encourage smoother assignment matrix X. The solution X can be written in form\nX = diag{u}Kdiag{v} (diag{v} returns a square diagonal matrix with vector v as the main\ndiagonal), and the iterations alternate between updating u and v is:\n\n(8)\n\nu(k+1) =\n\na\n\nKv(k+1)\n\n,\n\nv(k+1) =\n\nb\n\nK T u(k+1)\n\n(9)\n\nHere, K is a kernel matrix calculated with D. Since these iterations are solving a regularized version\nof the original problem, the corresponding Wasserstein distance that results is sometimes called the\nSinkhorn distance. Combining this constraint, the \ufb01nal loss function is:\n\nLW DV Sc = LM SE + \u03b2 \u00d7 LW D\n\n(10)\n\n3.5 A Realistic Setting with Unrelated Test Data\n\nExisting transductive ZSL methods always assume that all the images in the test dataset belong to\ntarget unseen classes we have already de\ufb01ned. However, in real scenarios, many unrelated images\nwhich do not belong to any de\ufb01ned class may exist. If we directly perform clustering on all these\nun\ufb01ltered images, the approximated real centers will deviate far from the real centers of unseen\nclasses and make the proposed visual structure constraint invalid. This problem also exists in [38].\nTo solve this relatively dif\ufb01cult setting, we propose a new training strategy for our method which \ufb01rst\nuses the baseline VCL to \ufb01lter out unrelated images before conducting CDVSc, BMVSc or WDVSc.\nStep 1: Since the source domain data is de\ufb01nitely clean, and we assume that the domain shift problem\nis not that severe, we \ufb01rst use VCL to get the initial unseen synthetic centers C.\n\n5\n\n\fTable 1: Quantitative comparisons of MCA (%) under standard splits (SS) in conventional ZSL setting. I:\nInductive, T: Transductive, O: Our method, Bold: Best, Blue: Second best, V: VGG, R: ResNet, G: GoogLeNet\n\nSUN72\n\n44.2\n\nSUN10\n\n82.5\n83.8\n\n\u2013\n\n\u2013\n\u2013\n\u2013\n\u2013\n\n\u2013\n\n\u2013\n\n86.0\n85.4\n\n87.8\n\n87.2\n90.6\n89.7\n91.2\n89.6\n90.5\n91.7\n92.2\n\nI\n\nT\n\nO\n\nf-CLSWGAN [33]\n\nMethod\n\nCONSE [24]\n\nSSE [36]\nJLSE [37]\nSynC [4]\nSAE [16]\nSCoRe [23]\n\nSP-ZSR [38]\nDSRL [34]\nDMaP [19]\nVZSL [31]\nQFSL [30]\n\nVCL\nCDVSc\nBMVSc\nWDVSc\nVCL\nCDVSc\nBMVSc\nWDVSc\n\nFeatures\n\nV+G+R\n\nR\nV\nV\nR\nR\nV\nR\nV\nV\n\nV\nV\nV\nV\nV\nV\nR\nR\nR\nR\n\nAwA1\n63.6\n76.3\n80.5\n72.2\n80.6\n82.8\n69.9\n92.0\n87.2\n90.5\n94.8\n\n\u2013\n\n81.7\n89.6\n92.7\n92.9\n82.0\n94.3\n95.9\n96.2\n\nAwA2\n67.9\n\n71.2\n80.7\n\n\u2013\n\u2013\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n84.1\n82.6\n93.3\n94.0\n94.2\n82.5\n93.9\n96.8\n96.7\n\nCUB\n36.7\n30.4\n42.1\n54.1\n33.4\n59.5\n61.5\n53.2\n57.1\n67.7\n66.5\n61.2\n58.2\n69.9\n70.8\n71.0\n60.1\n74.2\n73.6\n74.2\n\n59.1\n42.4\n\n62.1\n\n\u2013\n\u2013\n\n\u2013\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n58.8\n59.7\n61.3\n62.3\n63.8\n64.5\n66.2\n67.8\n\nStep 2: Find distance set D of the farthest point pair of each source class in visual feature space.\nStep 3: Select reliable image x if and only if \u2203ci \u2208 C, (cid:107)x \u2212 ci(cid:107)2\ntarget domain and perform unsupervised clustering on this domain.\nStep 4: Conduct CDVSc, BMVSc or WDVSc as above.\n\n2 \u2264 max(D)/2 to construct a new\n\n4 Experiments\n\nImplementation Details We adopt the pretrained ResNet-101 to extract visual features unless\nspeci\ufb01ed. All images are resized to 224 \u00d7 224 without any data augmentation, and the dimension\nof extracted features is 2048. The hidden unit numbers of the two FC layers in the embedding\nnetwork are both 2048. Both visual features and semantic attributes are L2-normalized. Using Adam\noptimizer, our method is trained for 5000 epochs with a \ufb01xed learning rate of 0.0001. The weight \u03b2 in\nCDVSc and BMVSc is cross-validated in [10\u22124, 10\u22123] and [10\u22125, 10\u22124] respectively, while WDVSc\ndirectly sets \u03b2 = 0.001 because of its very stable performance.\n\nDatasets To demonstrate the effectiveness of our method, extensive experiments are conducted on\nfour widely-used ZSL benchmark datasets, i.e., AwA1, AwA2, CUB ,SUN10, and SUN72. Following\nthe same con\ufb01guration of previous methods, two different data split strategies are adopted: 1)\nStandard Splits (SS): The standard seen/unseen class split is \ufb01rst proposed in [17] and then widely\nused in most ZSL works. 2) Proposed Splits (PS): This split way is proposed by[32] to remove the\noverlapped ImageNet-1K classes from target domain since it is used to pre-train the CNN model.\nPlease refer to the supplementary material for more details.\n\nEvaluation Metrics For fair comparison and completeness, we consider two different ZSL settings:\n1) Conventional ZSL, which assumes all the test instances only belong to target unseen classes.\n2) Generalized ZSL, where test instances are from both seen and unseen classes, which is a more\nrealistic setting in real applications. For the former setting, we compute the multi-way classi\ufb01cation\naccuracy (MCA) as in previous works, while for the latter one, we de\ufb01ne three metrics. 1) accYs \u2013\nthe accuracy of classifying the data samples from the seen classes to all the classes (both seen and\nunseen); 2) accYu \u2013 the accuracy of classifying the data samples from the unseen classes to all the\nclasses; 3) H \u2013 the harmonic mean of accYs and accYu.\n\n6\n\n\fTable 2: Quantitative comparisons under the pro-\nposed splits (PS).\n\nTable 3: Quantitative comparisons under general-\nized ZSL setting.\n\n38.8\n56.5\n53.7\n56.3\n40.3\n51.7\n\nMethod AwA2 CUB SUN72 Ave.\n39.2\n56.0\n56.5\n52.8\n42.5\n60.7\n\nSJE[2]\nSynC [4]\nSAE [16]\nSCoRe [23]\nLDF[20]\n\nCONSE [24] 44.5 34.3\nDeViSE [7]\n59.7 52.0\n61.9 53.9\n46.6 55.6\n54.1 33.3\n69.5 61.0\n69.2\nPSR-ZSL[3] 63.8 56.0\nDCN [21]\n56.2\n61.5 59.6\n78.2 71.7\n81.7 71.0\n87.3\n73.4\n\nVCL\nCDVSc\nBMVSc\nWDVSc\n\n61.4\n61.8\n59.4\n61.2\n62.2\n63.4\n\n60.1\n70.3\n71.6\n74.7\n\n60.4\n\n\u2013\n\n\u2013\n\n\u2013\n\n\u2013\n\n\u2013\n\nAwA2\n\nCUB\n\nSJE[2]\n\nMethod\n\n90.6 1.0\n1.6\n82.5 14.8 8.5\n\nCONSE [24] 0.5\n8.1\n\naccYu accYs H accYu accYs H\n72.2 3.1\nSSE [36]\n46.9 14.4\nDeViSE [7] 17.1 74.7 27.8 23.8 53.0 32.8\n73.9 14.4 23.5 59.2 33.6\n8.0\n5.9\n77.8 11.0 12.6 63.8 21.0\n10.0 90.5 18.0 11.5 70.9 19.8\n14.0 81.8 23.9 23.7 62.8 34.4\nPSR-ZSL[3] 20.7 73.8 32.3 24.6 54.3 33.9\n21.4 89.6 34.6 15.6 86.3 26.5\n66.9 88.1 76.0 37.0 84.6 51.4\n71.9 88.2 79.2 33.1 86.1 47.9\n76.4 88.1 81.8 43.3 85.4 57.5\n\nESZSL[28]\nSynC [4]\nALE[1]\n\nVCL\nCDVSc\nBMVSc\nWDVSc\n\n4.1 Conventional ZSL Results\n\nTo show the effectiveness of the proposed visual structure constraint, we \ufb01rst compare our method\nwith existing state-of-the-art methods in the conventional setting. Table 1 is the comparison results\nunder standard splits (SS), where we also re-implement our method using 4096-dimensional VGG\nfeatures to guarantee fairness. Obviously, with the three different types of visual structure constraint,\nour method can obtain substantial performance gains consistently on all the datasets and outperforms\nprevious state-of-the-art methods. The only exception is that VZSL [31] is slightly better than our\nmethod on the AwA1 dataset when using VGG features.\nSpecially, comparing with SP-ZSR [38] which shares the similar spirit with our method, we could\n\ufb01nd that their performance sometimes is even worse than inductive methods such as SynC [4], SCoRe\n[23] or VCL. The possible underlying reason is that, when utilizing the structure information only in\ntest time, the \ufb01nal performance gain highly depends on the quality of the project function. When\nthe projection function is not good enough, the initial synthetic centers will deviate far from the real\ncenters and result in bad matching results with unsupervised cluster centers, thus causing even worse\nresults. By contrast, in our method, this visual structure constraint is incorporated into the learning of\nprojection function in the training stage, which can help to learn a better projection function and bring\nperformance gain consistently. Another bonus is that, during runtime, we can directly do recognition\nin real-time online-mode rather than the batch-mode optimization in SP-ZSR [38], which is more\nfriendly in real applications.\nThe results on proposed splits of AwA2, CUB and SUN72 are reported in Table 2 with ResNet-101\nfeatures. It can be seen that almost all methods suffer from performance degradation under this\nsetting. However, our proposed method could still maintain the highest accuracy. Speci\ufb01cally, the\nimprovements obtained by our method range from 0.8% to 25.8%, which indicate that visual structure\nconstraint is effective to solve the domain shift problem.\n4.2 Generalized ZSL Results\n\nIn Table 3, we compare our method with eight different generalized ZSL methods. It can be seen that,\nalthough almost all the methods cannot maintain the same level accuracy for both seen (accYs) and\nunseen classes (accYu), our method with visual structure constraint still signi\ufb01cantly outperforms\nother methods by a large margin on these datasets. More speci\ufb01cally, take CONSE [24] as an example,\ndue to the domain shift problem, it can achieve the best results on the source seen classes but totally\nfails on the target unseen classes. By contrast, since the proposed two structure constraints can help\nto align the structure of synthetic centers to that of real unseen centers, our method can achieve\nacceptable ZSL performance on target unseen classes.\n\n4.3 Results of New Realistic Setting\n\nTo imitate the realistic setting where many unrelated images may exist in the test dataset, we mix the\ntest dataset with extra 8K unrelated images from the aPY dataset. These unrelated images do not\nbelong to the classes of either AwA2 or ImageNet-1K. From Table 4, it could be seen that without\n\n7\n\n\fTable 4: Results (%) on more realistic setting. With the new proposed training strategy (S + \u2217), the proposed\nmethod can still work well and bring performance gain.\n\nMethod\nSS+noise\nPS+noise\n\nVCL CDVSc BMVSc WDVSc S+CDVSc S+BMVSc S+WDVSc\n82.5\n61.5\n\n92.4\n78.3\n\n78.3\n58.9\n\n81.3\n60.8\n\n89.3\n65.3\n\n79.7\n57.4\n\n86.9\n66.7\n\nTable 5: Generality to the word vector based semantic space on the AwA1 dataset.\n\nMethod DeViSE[7] ZSCNN[18] SS-Voc[10] DEM[35] VCL CDVSc BMVSc WDVSc\nMCA (%)\n\n90.8\n\n50.4\n\n58.7\n\n68.9\n\n78.8\n\n72.3\n\n79.4\n\n83.9\n\n\ufb01ltering out the unrelated images, the performance of our method with CDVSc, BMVSc and WDVSc\ndegrades, which means that the alignment of wrong visual structures is harmful to the learning of\nprojection function. By contrast, the new training strategy (S + \u2217) can still make the proposed visual\nstructure constraint work very well.\n\n4.4 More Analysis\n\nDue to the limited space, only two analysis experiments are given in this section. Please refer to the\nsupplementary material for more analysis.\n\nGenerality to word vectors based semantic space. Compared to some previous methods which\nare only applicable to one speci\ufb01c semantic space, we further demonstrate that the proposed visual\nstructure constraint can also be applied to word vector-based semantic space. Speci\ufb01cally, to obtain\nthe word representations for the embedding networks inputs, we use the GloVe text model [25]\ntrained on the Wikipedia dataset leading to 300-d vectors. For the classes containing multiple words,\nwe match all the words in the trained model and average their word embeddings as the corresponding\ncategory embedding. As shown in Table 5, though the contained effective information of word\nvectors is less than that of semantic attributes, the proposed visual structure constraint can still bring\nsubstantial performance gain and outperform previous methods. Note that DEM [35] utilized 1000-d\nword vectors provided by [8, 9] to represent a category.\n\nRobustness for imperfect separability of visual features for unseen classes. Though our motiva-\ntion is inspired by the great separable ability of visual features for unseen classes on the AwA2 dataset,\nwe \ufb01nd the proposed visual structure constraint is very robust and does not rely on it seriously. For\nexample, on the CUB dataset, the feature distribution (refer to Fig 5 in the supplementary material) is\nnot totally separable, but the proposed visual structure constraint still brings signi\ufb01cant performance\ngain as shown in the above Tables. Because even though there are some incorrect clusters, as long as\nmost of them are correct clusters, the proposed visual structure constraint will be bene\ufb01cial.\nOn the other hand, though the unseen class number K is often pre-de\ufb01ned, we \ufb01nd the proposed\nvisual constraint can improve the performance even when it is unknown. In Table 6, we report the\nperformance for different K. Speci\ufb01cally, we \ufb01rst perform K-means both in the semantic space\nand visual space simultaneously, then use BMVSc and WDVSc to align these two synthetic sets.\nObviously, the proposed visual structure constraint can bring performance gain consistently. With\nthe increase of K, it could capture the more \ufb01ne-grained structure of visual space and achieve better\nresults. In other words, as long as the visual features can form some superclasses (not \ufb01ne-level,\nwhich is satis\ufb01ed by most datasets), the proposed visual structure constraint is always effective.\n\n5 Conclusion\n\nTo alleviate the domain shift problem in ZSL, three new different types of visual structure constraint\nare proposed for transductive ZSL in this paper.We also introduce a new transductive ZSL con-\n\ufb01guration for real applications and design a new training strategy to make our method work well.\nExperiments demonstrate that they can bring substantial performance gain consistently on different\nbenchmark datasets and outperform previous state-of-the-art methods by a large margin. In the future,\nwe will try to apply the proposed idea to broader application scenarios [13, 11, 14, 29].\n\n8\n\n\fTable 6: Results (%) of different cluster number K on AwA2 dataset.\n\n0\n\n61.5\n61.5\n\n3\n\n62.3\n63.4\n\n4\n\n62.9\n64.0\n\n5\n\n64.5\n66.3\n\n6\n\n65.1\n67.0\n\nK\n\nBMVSc\nWDVSc\n\n7\n\n68.2\n69.2\n\n8\n\n70.1\n75.1\n\n9\n\n74.3\n80.3\n\n10\n81.7\n87.3\n\n6 Acknowledgements\n\nWe would like to thank the anonymous reviewers for their thoughtful comments and efforts towards\nimproving our work. This work was supported by the Natural Science Foundation of China (NSFC)\nNo.61876181, CityU start-up grant 7200607 and Hong Kong ECS grant 21209119.\n\nReferences\n[1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for image classi\ufb01cation.\n\nTPAMI, 2016.\n\n[2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for\n\n\ufb01ne-grained image classi\ufb01cation. In CVPR, 2015.\n\n[3] Y. Annadani and S. Biswas. Preserving semantic relations for zero-shot learning. In CVPR,\n\n2018.\n\n[4] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthesized classi\ufb01ers for zero-shot learning.\n\nIn CVPR, 2016.\n\n[5] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in\n\nneural information processing systems, pages 2292\u20132300, 2013.\n\n[6] H. Fan, H. Su, and L. Guibas. A point set generation network for 3d object reconstruction from\n\na single image. In CVPR, 2017.\n\n[7] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep\n\nvisual-semantic embedding model. In NIPS, 2013.\n\n[8] Y. Fu, T. M. Hospedales, T. Xiang, Z. Fu, and S. Gong. Transductive multi-view embedding for\n\nzero-shot recognition and annotation. In ECCV, 2014.\n\n[9] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Transductive multi-view zero-shot learning.\n\nTPAMI, 2015.\n\n[10] Y. Fu and L. Sigal. Semi-supervised vocabulary-informed learning. In CVPR, 2016.\n\n[11] P. Fuchs, C. Loeseken, J. K. Schubert, and W. Miekisch. Breath gas aldehydes as biomarkers of\n\nlung cancer. IJC, 126(11):2663\u20132670, 2010.\n\n[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n2016.\n\n[13] S. Hong, M. Wu, Y. Zhou, Q. Wang, J. Shang, H. Li, and J. Xie. Encase: An ensemble classi\ufb01er\nfor ecg classi\ufb01cation using expert features and deep neural networks. In CinC, pages 1\u20134. IEEE,\n2017.\n\n[14] S. Hong, Y. Zhou, M. Wu, J. Shang, Q. Wang, H. Li, and J. Xie. Combining deep neural\nnetworks and engineered features for cardiac arrhythmia detection from ecg recordings. PM,\n40(5):054009, 2019.\n\n[15] E. Kodirov, T. Xiang, Z. Fu, and S. Gong. Unsupervised domain adaptation for zero-shot\n\nlearning. In ICCV, 2015.\n\n[16] E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zero-shot learning. In CVPR,\n\n2017.\n\n9\n\n\f[17] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by\n\nbetween-class attribute transfer. In CVPR, 2009.\n\n[18] J. Lei Ba, K. Swersky, S. Fidler, et al. Predicting deep zero-shot convolutional neural networks\n\nusing textual descriptions. In ICCV, 2015.\n\n[19] Y. Li, D. Wang, H. Hu, Y. Lin, and Y. Zhuang. Zero-shot recognition using dual visual-semantic\n\nmapping paths. In CVPR, 2017.\n\n[20] Y. Li, J. Zhang, J. Zhang, and K. Huang. Discriminative learning of latent features for zero-shot\n\nrecognition. In CVPR, 2018.\n\n[21] S. Liu, M. Long, J. Wang, and M. I. Jordan. Generalized zero-shot learning with deep calibration\nnetwork. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 31, pages 2005\u20132015. Curran\nAssociates, Inc., 2018.\n\n[22] Y. Lu. Unsupervised learning on neural network outputs: with application in zero-shot learning.\n\nIJCAI, 2016.\n\n[23] P. Morgado and N. Vasconcelos. Semantically consistent regularization for zero-shot recognition.\n\nIn CVPR, 2017.\n\n[24] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean.\n\nZero-shot learning by convex combination of semantic embeddings. ICLR, 2014.\n\n[25] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation.\n\nIn EMNLP, 2014.\n\n[26] M. Radovanovi\u00b4c, A. Nanopoulos, and M. Ivanovi\u00b4c. Hubs in space: Popular nearest neighbors\n\nin high-dimensional data. JMLR, 2010.\n\n[27] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of \ufb01ne-grained visual\n\ndescriptions. In CVPR, 2016.\n\n[28] B. Romera-Paredes and P. Torr. An embarrassingly simple approach to zero-shot learning. In\n\nICML, 2015.\n\n[29] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng. Convolutional-recursive deep\n\nlearning for 3d object classi\ufb01cation. In NIPs, pages 656\u2013664, 2012.\n\n[30] J. Song, C. Shen, Y. Yang, Y. Liu, and M. Song. Transductive unbiased embedding for zero-shot\n\nlearning. In CVPR, 2018.\n\n[31] W. Wang, Y. Pu, V. K. Verma, K. Fan, Y. Zhang, C. Chen, P. Rai, and L. Carin. Zero-shot\n\nlearning via class-conditioned deep generative models. AAAI, 2018.\n\n[32] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-shot learning-a comprehensive evaluation\n\nof the good, the bad and the ugly. TPAMI, 2018.\n\n[33] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zero-shot learning.\n\nIn CVPR, 2018.\n\n[34] M. Ye and Y. Guo. Zero-shot classi\ufb01cation with discriminative semantic representation learning.\n\nIn CVPR, 2017.\n\n[35] L. Zhang, T. Xiang, and S. Gong. Learning a deep embedding model for zero-shot learning. In\n\nCVPR, 2017.\n\n[36] Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In ICCV,\n\n2015.\n\n[37] Z. Zhang and V. Saligrama. Zero-shot learning via joint latent similarity embedding. In CVPR,\n\n2016.\n\n10\n\n\f[38] Z. Zhang and V. Saligrama. Zero-shot recognition via structured prediction. In ECCV, 2016.\n\n[39] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation.\n\n2002.\n\n[40] Y. Zhu, J. Xie, Z. Tang, X. Peng, and A. Elgammal. Learning where to look: Semantic-guided\n\nmulti-attention localization for zero-shot learning. arXiv preprint arXiv:1903.00502, 2019.\n\n11\n\n\f", "award": [], "sourceid": 5283, "authors": [{"given_name": "Ziyu", "family_name": "Wan", "institution": "City University of Hong Kong"}, {"given_name": "Dongdong", "family_name": "Chen", "institution": "university of science and technology of china"}, {"given_name": "Yan", "family_name": "Li", "institution": "Institute of Automation, Chinese Academy of Sciences"}, {"given_name": "Xingguang", "family_name": "Yan", "institution": "Shenzhen University"}, {"given_name": "Junge", "family_name": "Zhang", "institution": "CASIA"}, {"given_name": "Yizhou", "family_name": "Yu", "institution": "Deepwise AI Lab"}, {"given_name": "Jing", "family_name": "Liao", "institution": "City University of Hong Kong"}]}