{"title": "Dual Adversarial Semantics-Consistent Network for Generalized Zero-Shot Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 6146, "page_last": 6157, "abstract": "Generalized zero-shot learning (GZSL) is a challenging class of vision and knowledge transfer problems in which both seen and unseen classes appear during testing. Existing GZSL approaches either suffer from semantic loss and discard discriminative information at the embedding stage, or cannot guarantee the visual-semantic interactions. To address these limitations, we propose a Dual Adversarial Semantics-Consistent Network (referred to as DASCN), which learns both primal and dual Generative Adversarial Networks (GANs) in a unified framework for GZSL. In DASCN, the primal GAN learns to synthesize inter-class discriminative and semantics-preserving visual features from both the semantic representations of seen/unseen classes and the ones reconstructed by the dual GAN. The dual GAN enforces the synthetic visual features to represent prior semantic knowledge well via semantics-consistent adversarial learning. To the best of our knowledge, this is the first work that employs a novel dual-GAN mechanism for GZSL. Extensive experiments show that our approach achieves significant improvements over the state-of-the-art approaches.", "full_text": "Dual Adversarial Semantics-Consistent Network for\n\nGeneralized Zero-Shot Learning\n\nJian Ni1\n\nnj1@mail.ustc.edu.cn\n\nShanghang Zhang2\n\nshanghaz@andrew.cmu.edu\n\nHaiyong Xie3,4,1\n\nhaiyong.xie@ieee.org\n\n1University of Science and Technology of China, Anhui 230026, China\n\n2Carnegie Mellon University, Pittsburgh, PA 15213, USA\n\n3Advanced Innovation Center for Human Brain Protection, Capital Medical University, Beijing\n\n100054, China\n\n4National Engineering Laboratory for Public Safety Risk Perception and Control by Big Data\n\n(NEL-PSRPC), Beijing 100041, China\n\nAbstract\n\nGeneralized zero-shot learning (GZSL) is a challenging class of vision and knowl-\nedge transfer problems in which both seen and unseen classes appear during\ntesting. Existing GZSL approaches either suffer from semantic loss and discard\ndiscriminative information at the embedding stage, or cannot guarantee the visual-\nsemantic interactions. To address these limitations, we propose a Dual Adversarial\nSemantics-Consistent Network (referred to as DASCN), which learns both primal\nand dual Generative Adversarial Networks (GANs) in a uni\ufb01ed framework for\nGZSL. In DASCN, the primal GAN learns to synthesize inter-class discriminative\nand semantics-preserving visual features from both the semantic representations of\nseen/unseen classes and the ones reconstructed by the dual GAN. The dual GAN\nenforces the synthetic visual features to represent prior semantic knowledge well\nvia semantics-consistent adversarial learning. To the best of our knowledge, this\nis the \ufb01rst work that employs a novel dual-GAN mechanism for GZSL. Extensive\nexperiments show that our approach achieves signi\ufb01cant improvements over the\nstate-of-the-art approaches.\n\n1\n\nIntroduction\n\nIn recent years, tremendous progress has been achieved across a wide range of computer vision\nand machine learning tasks with the introduction of deep learning. However, conventional deep\nlearning approaches rely on large amounts of labeled data, thus may suffer from performance decay in\nproblems where only limited training data are available. The reasons are two folds. On the one hand,\nobjects in the real world have a long-tailed distribution, and obtaining annotated data is expensive. On\nthe other hand, novel categories of objects arise dynamically in nature, which fundamentally limits\nthe scalability and applicability of supervised learning models for handling this dynamic scenario\nwhen labeled examples are not available.\nTackling such restrictions, zero-shot learning (ZSL) has been researched widely recently, recognized\nas a feasible solution [16, 24]. ZSL is a learning paradigm that tries to ful\ufb01ll the ability to correctly\ncategorize objects from previous unseen classes without corresponding training samples. However,\nconventional ZSL models are usually evaluated in a restricted setting where test samples and the\nsearch space are limited to the unseen classes only, as shown in Figure 1. To address the shortcomings\nof ZSL, GZSL has been considered in the literature since it not only learns information that can be\ntransferred to an unseen class but can also generalize to new data from seen classes well.\nZSL approaches typically adopt two commonly used strategies. The \ufb01rst strategy is to convert tasks\ninto visual-semantic embedding problems [4, 23, 26, 33]. They try to learn a mapping function from\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n1\n2\n3\n4\n5\n6\n\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n\n22\n\n23\n24\n25\n26\n27\n28\n29\n30\n\n31\n32\n33\n34\n35\n36\n37\n\n38\n39\n\n\f40\n41\n42\n43\n44\n45\n46\n47\n48\n49\n50\n51\n\n52\n53\n54\n55\n56\n57\n58\n59\n60\n61\n62\n63\n64\n65\n66\n67\n68\n69\n70\n71\n\n72\n73\n74\n75\n76\n77\n78\n79\n\n80\n81\n\nthe visual space to the semantic space (note that all the classes reside in the semantic space), or to a\nlatent intermediate space, so as to transfer knowledge from the seen classes to the unseen classes.\nHowever, the ability of these embedding-based ZSL models to transfer semantic knowledge is limited\nby the semantic loss and the heterogeneity gap [4]. Meanwhile, since the ZSL model is only trained\nwith the labeled data from the seen classes, it is highly biased towards predicting the seen classes [3].\nThe second strategy ZSL approaches typically adopt is to use generative methods to generate various\nvisual features conditioned on semantic feature vectors [7, 10, 19, 29, 35], which circumvents the\nneed for labeled samples of unseen classes and boosts the ZSL classi\ufb01cation accuracy. Nevertheless,\nthe performance of these methods is limited either by capturing the visual distribution information\nvia only a unidirectional alignment from the class semantics to the visual feature only, or by adopting\njust a single Euclidean distance as the constraint to preserve the semantic information between the\ngenerated high-level visual features and real semantic features. Recent work has shown that the\nperformance of most ZSL approaches drops signi\ufb01cantly in the GZSL setting [28].\n\nFigure 1: Problem illustration of zero-shot learning (ZSL) and generalized zero-shot learning (GZSL).\n\nTo address these limitations, we propose a novel Dual Adversarial Semantics-Consistent Network\n(referred to as DASCN) for GZSL. DASCN is based on the Generative Adversarial Networks (GANs),\nand is characterized by its dual structure which enables bidirectional synthesis by allowing both\nthe common visual features generation and the corresponding semantic features reconstruction, as\nshown in Figure 2. Such bidirectional synthesis procedures available in DASCN boost these two tasks\njointly and collaboratively, preserving the visual-semantic consistency. This results in two advantages\nas follows. First, our generative model synthesizes inter-class discrimination visual features via\na classi\ufb01cation loss constraint, which makes sure that synthetic visual features are discriminative\nenough among different classes. Second, our model encourages the synthesis of visual features that\nrepresent their semantic features well and are of a highly discriminative semantic nature from the\nperspectives of both form and content. From the form perspective, the semantic reconstruction error\nbetween the synthetic semantic features (reconstructed by the dual GAN from the pseudo visual\nfeatures generated by the primal GAN) and the real corresponding semantic features is minimized to\nensure that the reconstructed semantic features are tightly centered around the real corresponding\nclass semantics. From the content perspective, the pseudo visual features (generated via the primal\nGAN by further exploiting the reconstructed semantic features as input) are constrained to be as close\nas possible to their respective real visual features in the data distribution. Therefore, our approach\ncan ensure that the reconstructed semantic features are consistent with the relevant real semantic\nknowledge, thus avoiding semantic loss to a large extent.\nWe summarize our contributions as follows. First, we propose a novel generative dual adversarial\narchitecture for GZSL, which preserves semantics-consistency effectively with a bidirectional align-\nment, alleviating the issue of semantic loss. To the best of our knowledge, DASCN is the \ufb01rst network\nto employ the dual-GAN mechanism for GZSL. Second, by combining the classi\ufb01cation loss and\nthe semantics-consistency adversarial loss, our model generates high-quality visual features with\ninter-class separability and a highly discriminative semantic nature, which is crucial to the generative\napproaches used in GZSL. Last but no least, we conduct comprehensive experiments demonstrating\nthat DASCN is highly effective and outperforms the state-of-the-art GZSL methods consistently.\nThe remainder of this paper is structured as follows. We discuss the related work in Section 2, present\nour DASCN model in Section 3, evaluate the proposed model in Section 4, and conclude in Section 5.\n\n2\n\nSeen categories for TrainingLaniusyellow: yeswhite: noblack: yesbeak: yes... Geococcyxyellow: nowhite: noblack: yesbeak: yes... Orioleyellow: yeswhite: noblack: yesbeak: yes... ZSL:Test images from Unseen categoriesGZSL:Test images from Seen and Unseen categories\f82\n\n83\n\n84\n85\n86\n87\n88\n89\n\n90\n91\n92\n93\n94\n95\n96\n97\n98\n99\n100\n\n101\n102\n103\n104\n105\n106\n107\n\n108\n\n109\n110\n111\n112\n113\n114\n115\n\n116\n\n117\n118\n119\n\n2 Related Work\n\n2.1 Zero-Shot Learning\n\nSome of the early ZSL works make use of the primitive attributes prediction and classi\ufb01cation, such as\nDAP [15], and IAP [16]. Recently, the attribute-based classi\ufb01er has evolved into the embedding-based\nframework, which now prevails due to its simple and effective paradigm [1, 23, 24, 25, 33]. The core\nof such approaches is to learn a projection from visual space to semantic space spanned by class\nattributes [23, 24], or conversely [33], or jointly learn an appropriate compatibility function between\nthe visual space and the semantic space [1, 25].\nThe main disadvantage of the above methods is that the embedding process suffers from semantic loss\nand the lack of visual training data for unseen classes, thus biasing the prediction towards the seen\nclasses and undermining seriously the performance of models in the GZSL setting. More recently,\ngenerative approaches are promising for GZSL setting by generating labeled samples for the seen\nand unseen classes. [10] synthesize samples by approximating the class conditional distribution\nof the unseen classes based on learning that of the seen classes. [29, 35] apply GAN to generate\nvisual features conditioned on class descriptions or attributes, which ignore the semantics-consistency\nconstraint and allow the production of synthetic visual features that may be too far from the actual\ndistribution. [7] consider minimizing L2 norm between real semantics and reconstructed semantics\nproduced by a pre-trained regressor, which is rather weak and unreliable to preserve high-level\nsemantics via the Euclidean distance.\nDASCN differs from the above approaches in that it learns the semantics effectively via multi-\nadversarial learning from both the form and content perspectives. Note that ZSL is also closely\nrelated to domain adaptation and image-to-image translation tasks, where all of them assume the\ntransfer between source and target domains. Our approach is motivated by, and is similar in spirit\nto, recent work on synthesizing samples for GZSL [29] and unpaired image-to-image translation\n[11, 30, 34]. DASCN preserves the visual-semantic consistency by employing dual GANs to capture\nthe visual and semantic distributions, respectively.\n\n2.2 Generative Adversarial Networks\n\nAs one of the most promising generative models, GANs have achieved a series of impressive results.\nThe idea behind GANs is to learn a generative model to capture an arbitrary data distribution via a\nmax-min training procedure, which consists of a generator and a discriminator that work against each\nother. DCGAN [22] extends GAN by leveraging deep convolution neural networks. InfoGAN [5]\nmaximizes the mutual information between the latent variables and generator distribution. In our\nwork, given stabilizing training behavior and eliminating model collapse as much as possible, we\nutilize WGANs [9] as basic models in a dual structure.\n\n3 Methodology\n\nIn this section, we \ufb01rst formalize the GZSL task in Section 3.1. Then we present our model and\narchitecture in Section 3.2. We then describe in detail our model\u2019s objective, training procedures and\ngeneralized zero-shot recognition in Section 3.3, Section 3.4 and Section 3.5, respectively.\n\n120\n\n121\n\n122\n123\n\n3.1 Formulation\n\nWe denote by DT r = (cid:8)(x, y, a)|x \u2208 X , y \u2208 Y s, a \u2208 A(cid:9) the set of N s training instances of the\n124 U =(cid:8)(y, a)|y \u2208 Y u, a \u2208 A(cid:9) of unseen classes, where visual features are missing. Given DT r and\n\nseen classes. Note that x \u2208 X \u2286 RK represents K-dimensional visual features extracted from\nconvolution neural networks, Y s denotes the corresponding class labels, and a \u2208 A \u2286 RL denotes\nsemantic features (e.g., the attributes of seen classes). In addition, we have a disjoint class label set\n125 U, in GZSL, we learn a prediction: X \u2192 Y s \u222a Y u. Note that our method is of the inductive school\nwhere model has access to neither visual nor semantic information of unseen classes in the training\nphase.\n\n126\n127\n128\n\n3\n\n\fFigure 2: Network architecture of DASCN. The semantic feature of class c, represented as ac, and a\ngroup of randomly sampled noise vectors are utilized by generator GSV to synthesize pseudo visual\n(cid:48). Then the synthesized visual features are used by generator GV S and discriminator DV\nfeatures xc\nsimultaneously to perform semantics-consistency constraint in the perspective of form and content\n(cid:48). DS denotes the\nand distinguish between real visual features xc and synthesized visual features xc\n(cid:48) generated\ndiscriminator that distinguishes between ac, and the reconstructed semantic feature ac\n(cid:48) and sampled noise as input\nfrom corresponding xc\nto perform visual consistency constraint. Please zoom to view better.\n\n(cid:48)(cid:48) are produced by generator GSV taking ac\n\n(cid:48). xc\n\n129\n\n130\n131\n132\n133\n134\n135\n136\n137\n138\n\n139\n140\n141\n142\n143\n144\n\n145\n146\n\n147\n148\n\n149\n150\n151\n152\n\n3.2 Model Architecture\n\nGiven the training data DT r of the seen classes, the primal task of DASCN is to learn a generator\nGSV :Z \u00d7A \u2192 X that takes the random Gaussian noise z \u2208 Z and semantic attribute a \u2208 A as input\nto generate the visual feature x(cid:48) \u2208 X , while the dual task is to train an inverse generator GV S:X \u2192 A.\nOnce the generator GSV learns to generate visual features of the seen classes conditioned on the seen\nclass-level attributes, it can also generate that of the unseen classes. To realize this, we employ two\nWGANs, the primal GAN and the dual GAN. The primal GAN consists of the generator GSV and the\ndiscriminator DV that discriminates between fake visual features generated by GSV and real visual\nfeatures. Similarly, the dual GAN learns a generator GV S and a discriminator DS that distinguishes\nthe fake semantic features generated by GV S from the real data.\nThe overall architecture and data \ufb02ow are illustrated in Figure 2. In the primal GAN, we hallucinate\npseudo visual features x(cid:48)\nc = GSV (ac, z) of the class c using GSV based on corresponding class\nsemantic features ac and then put the real visual features and synthetic features from GSV into DV to\nbe evaluated. To ensure that GSV generates inter-class discrimination visual features, inspired by that\nwork [29], we design a classi\ufb01er trained on the real visual features and minimize the classi\ufb01cation\nloss over the generated features. It is formulated as:\n\nLCLS = \u2212Ex(cid:48)\u223cP x(cid:48)[logP (y|x(cid:48); \u03b8)]\n\n(1)\nwhere x(cid:48) represents the generated visual feature, y is the class label of x(cid:48), the conditional probability\nP (y|x(cid:48); \u03b8) is computed by a linear softmax classi\ufb01er parameterized by \u03b8.\nFollowing that is one of the main innovations in our work that we guarantee semantics-consistency\nin both form and content perspectives thanks to dual structure. In form, GSV (a, z) are translated\nback to semantic space using GV S, which outputs a(cid:48) = GV S\na. To achieve the goal that the generated semantic features of each class are distributed around the\ncorresponding true semantic representation, we design the centroid regularization that regularizes the\nmean of generated semantic features of each class to approach respectively real semantic embeddings\n\n(cid:0)GSV (a, z)(cid:1) as the reconstruction of\n\n4\n\nDSGSVGVS+Primary Color+Belly Pattern+Crown Color+Breast Pattern+...DVP(y|xc';\u03b8)xc'xc''xcacac'realfakerealfakeSemantic SpaceVisual SpaceCNNac(cid:2278)(cid:1845)(cid:1829)(cid:2278)(cid:1848)(cid:1829)(cid:2278)(cid:1829)(cid:1838)(cid:1845)\f153\n\n154\n155\n156\n\n157\n158\n159\n160\n161\n162\n\n163\n164\n\n165\n\n166\n167\n\n168\n\n169\n170\n171\n172\n\n173\n174\n175\n176\n\n177\n178\n179\n\n180\n\n181\n\n182\n183\n\nso as to maintain semantics-consistency to a large extent. The regularization is formulated as:\n\n(cid:13)(cid:13)(cid:13)Eac\n\nC(cid:88)\n\nc=1\n\nLSC =\n\n1\nC\n\n(cid:48)\u223cP c\n\na(cid:48) [ac\n\n(cid:48)] \u2212 ac\n\n(cid:13)(cid:13)(cid:13)2\n\nwhere C is the number of seen classes, ac is the semantic feature of class c, P c\ndistribution of generated semantic features of class c, ac\ncentroid is formulated as:\n\na(cid:48) denotes the conditional\n(cid:48) are the generated features of class c and the\n\nEac\n\n(cid:48)\u223cP c\n\na(cid:48) [ac\n\n(cid:48)] =\n\n1\nN s\nc\n\n(cid:0)GSV (ac, zi)(cid:1)\n\nGV S\n\nc(cid:88)\n\nN s\n\ni=1\n\nwhere N s\nc is the number of generated semantic features of class c. We employ the centroid regulariza-\ntion to encourage GV S to reconstruct semantic features of each seen class that statistically match real\nfeatures of that class. From the content point of view, the question of how well the pseudo semantic\nfeatures a(cid:48) are reconstructed can be translated into the evaluation of the visual features obtained by\nGSV taking a(cid:48) as input. Motivated by the observation that visual features have a higher intra-class\nsimilarity and relatively lower inter-class similarity, we introduce the visual consistency constraint:\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\nwhere xc is the visual features of class c, xc\nGSV employing GV S\nsynthetic features respectively and the centroid of xc\n\n(cid:48)(cid:48) is the pseudo visual feature generated by generator\nx(cid:48)(cid:48) are conditional distributions of real and\n\nx and P c\n\n(cid:48)(cid:48)\u223cP c\nx(cid:48)(cid:48)\n\nc=1\n\n1\nC\n\nLV C =\n\nC(cid:88)\n\n(cid:13)(cid:13)(cid:13)Exc\n(cid:0)GSV (ac, z)(cid:1) as input, P c\nc(cid:88)\n\n(cid:48)(cid:48)(cid:3) =\n\n(cid:2)xc\n\nN s\n\n1\nN s\nc\n\ni=1\n\nExc\n\n(cid:48)(cid:48)\u223cP c\nx(cid:48)(cid:48)\n\n(cid:2)xc\n\n(cid:48)(cid:48)(cid:3) \u2212 Exc\u223cP c\n\nx\n\n(cid:2)xc\n\n(cid:3)(cid:13)(cid:13)(cid:13)2\n\n(cid:16)\n\n(cid:48)(cid:48) is formulated as:\n\n(cid:0)GSV (ac, zi)(cid:1), zi\n\n(cid:48)(cid:17)\n\nGSV\n\nGV S\n\nIt is worth nothing that our model is constrained in terms of both form and content aspects to achieve\nthe goal of retaining semantics-consistency and achieves superior results in extensive experiments.\n\n3.3 Objective\n\n(cid:2)DV (x, a)(cid:3)+\u03bb1E\u02c6x\u223cP \u02c6x\n\nLDV = Ex(cid:48)\u223cP x(cid:48)(cid:2)DV (x(cid:48), a)(cid:3)\u2212Ex\u223cPdata\n\nGiven the issue that the Jenson-Shannon divergence optimized by the traditional GAN leads to\ninstability during training, our model is based on two WGANs that leverage the Wasserstein distance\nbetween two distributions as the objectives. The corresponding loss functions used in the primal\nGAN are de\ufb01ned as follows. First,\n\n(6)\nwhere \u02c6x = \u03b1x + (1 \u2212 \u03b1)x(cid:48) with \u03b1 \u223c U (0, 1), \u03bb1 is the penalty coef\ufb01cient, the \ufb01rst two terms\napproximate Wasserstein distance of the distributions of fake features and real features, the third term\nis the gradient penalty. Second, the loss function of the generator of the primal GAN is formulated as:\n\n(cid:104)(cid:16)(cid:13)(cid:13)(cid:79)\u02c6xDV (\u02c6x, a)(cid:13)(cid:13)2 \u2212 1\n(cid:17)2(cid:105)\nLGSV = \u2212Ex(cid:48)\u223cP x(cid:48)(cid:2)DV (x(cid:48), a)(cid:3) \u2212 Ea(cid:48)\u223cP a(cid:48)(cid:2)DV (x(cid:48)(cid:48), a(cid:48))(cid:3) + \u03bb2LCLS + \u03bb3LV C\n\n(7)\nwhere the \ufb01rst two terms are Wasserstein loss, the third term is the classi\ufb01cation loss corresponding\nto class labels, the forth term is visual consistency constraint introduced before, and \u03bb1, \u03bb2, \u03bb3 are\nhyper-parameters.\nSimilarly, the loss functions of the dual GAN are formulated as:\n\nLDS = Ea(cid:48)\u223cP a(cid:48)(cid:2)DS(a(cid:48))(cid:3) \u2212 Ea\u223cP a\n\n(cid:2)DS(a)(cid:3) + \u03bb4E\u02c6y\u223cP \u02c6y\n\n(cid:104)(cid:16)(cid:13)(cid:13)(cid:79)\u02c6yDS(\u02c6y)(cid:13)(cid:13)2 \u2212 1\n(cid:17)2(cid:105)\n\n(8)\n\nLGV S = \u2212Ea(cid:48)\u223cP a(cid:48)(cid:2)DS(a(cid:48))(cid:3) + \u03bb5LSC + \u03bb6LV C\n\n(9)\nIn Eq. (8) and Eq. (9), \u02c6y = \u03b2a + (1 \u2212 \u03b2)a(cid:48) is the linear interpolation of the real semantic feature a\nand the fake a(cid:48), and \u03bb4, \u03bb5, \u03bb6 are hyper-parameters weighting the constraints.\n\n5\n\n\fTable 1: Datasets used in our experiments, and their statistics\n\nDataset\nCUB\nSUN\nAWA1\naPY\n\nSemantics/Dim # Image\n11788\n14340\n30475\n15339\n\nA/312\nA/102\nA/85\nA/64\n\n# Seen Classes\n\n# Unseen Classes\n\n150\n645\n40\n20\n\n50\n72\n10\n12\n\n3.4 Training Procedure\n\nWe train the discriminators to judge features as real or fake and optimize the generators to fool the\ndiscriminator. To optimize the DASCN model, we follow the training procedure proposed in WGAN\n[9]. The training procedure of our framework is summarized in Algorithm 1. In each iteration, the\ndiscriminators DV , DS are optimized for n1, n2 steps using the loss introduced in Eq. (6) and Eq.\n(8) respectively, and then one step on generators with Eq. (7) and Eq. (9) after the discriminators\nhave been trained. According to [30], such a procedure enables the discriminators to provide more\nreliable gradient information. The training for traditional GANs suffers from the issue that the\nsigmoid cross-entropy is locally saturated as discriminator improves, which may lead to vanishing\ngradient and need to balance discriminator and generator carefully. Compared to the traditional\nGANs, the Wasserstein distance is differentiable almost everywhere and demonstrates its capability\nof extinguishing mode collapse. We put the detailed algorithm for training DASCN model in the\nsupplemental material.\n\n3.5 Generalized Zero-Shot Recognition\n\nWith the well-trained generative model, we can elegantly generate labeled exemplars of any class\nby employing the unstructured component z resampled from random Gaussian noise and the class\nsemantic attribute ac into the GSV . An arbitrary number of visual features can be synthesized and\nthose exemplars are \ufb01nally used to train any off-the-shelf classi\ufb01cation model. For simplicity, we\nadopt a softmax classi\ufb01er. Finally, the prediction function for an input test visual feature v is:\n\nP (y|v; \u03b8(cid:48))\n\n(10)\n\nf (v) = arg max\ny\u2208 \u02dcY\n\nwhere \u02dcY = Y s \u222a Y u for GZSL.\n\n4 Experiments\n\n4.1 Datasets and Evaluation Metrics\n\n184\n\n185\n186\n187\n188\n189\n190\n191\n192\n193\n194\n195\n196\n\n197\n\n198\n199\n200\n201\n202\n\n203\n\n204\n\n205\n\n206\n207\n208\n209\n210\n211\n\nTo test the effectiveness of the proposed model for GZSL, we conduct extensive evaluations on\nfour benchmark datasets: CUB [27], SUN [21], AWA1 [15], aPY [6] and compare the results with\nstate-of-the-art approaches. Statistics of the datasets are presented in Table 1. For all datasets, we\nextract 2048 dimensional visual features via the 101-layered ResNet from the entire images, which is\nthe same as [29]. For fair comparison, we follow the training/validation/testing split as described in\n[28].\nDuring test time, in the GZSL setting, the search space includes both the seen and unseen classes, i.e.\n212 Y u \u222a Y s. To evaluate the GZSL performance over all classes, the following measures are applied. (1)\nts: average per-class classi\ufb01cation accuracy on test images from the unseen classes with the prediction\nlabel set being Y u \u222a Y s. (2) tr: average per-class classi\ufb01cation accuracy on test images from the seen\nclasses with the prediction label set being Y u \u222aY s. (3) H: the harmonic mean of above de\ufb01ned tr and\nts, which is formulated as H = (2 \u00d7 ts \u00d7 tr)/(ts + tr) and quantities the aggregate performance\nacross both seen and unseen test classes. We hope that our model is of high accuracy on both seen\nand unseen classes.\n\n213\n214\n215\n216\n217\n218\n219\n\n220\n\n221\n222\n\n4.2\n\nImplementation Details\n\nOur implementation is based on PyTorch. DASCN consists of two generators and two discriminators:\nGSV , GV S, DV , DS. We train speci\ufb01c models with appropriate hyper-parameters. Due to the space\n\n6\n\n\fTable 2: Evaluations on four benchmark datasets. *indicates that Cycle-WGAN employs 1024-dim\nper-class sentences as class semantic rather than 312-dim per-class attributes on CUB, whose results\non CUB may not be directly comparable with others.\nSUN\n\nAWA1\n\nCUB\n\naPY\ntr\n\n-\n\nts\n0.9\n13.4\n6.6\n11.3\n1.8\n19.1\n\nMethod\nCMT [24]\nDEVISE [8]\nESZSL [23]\nSJE [1]\nSAE [13]\nLESAE [18]\nSP-AEN [4]\nRN [25]\n31.4\nTRIPLE [31]\n27\nf-CLSWGAN [29] 57.9\nKERNEL [32]\n18.3\nPSR [2]\n25.5\nDCN [17]\n56.3\nSE-GZSL [14]\n25.7\nGAZSL [35]\n59.3\nDASCN (Ours)\nCycle-WGAN* [7] 56.4\n\n-\n\ntr\n87.6\n68.7\n75.6\n74.6\n77.1\n70.2\n\n-\n\n91.3\n67.9\n61.4\n79.3\n\n-\n\n84.2\n67.8\n82.0\n68.0\n63.5\n\n-\n\n-\n\n-\n\nts\n\nH\n\nH\n\n8.8\n\nts\n8.1\n\ntr\nts\n21.8 11.8 7.2\n\ntr\nH\nH\n49.8 12.6 1.4 85.2 2.8\n1.8\n22.4 16.9 27.4 20.9 23.8 53.0 32.8 4.9 76.9 9.2\n12.1 11.0 27.9 15.8 12.6 63.8 21.0 2.4 70.1 4.6\n19.6 14.7 30.5 19.8 23.5 59.2 33.6 3.7 55.7 6.9\n3.5\n54.0 13.6 0.4 80.9 0.9\n30.0 21.9 34.7 26.9 24.3 53.0 33.3 12.7 56.1 20.1\n24.9 38.2 30.3 34.7 70.6 46.6 13.7 63.4 22.6\n\n18.0 11.8 7.8\n\n46.7\n38.1 61.1 47.0\n38.6 22.2 38.3 28.1 26.5 62.3 37.2\n59.6 42.6 36.6 39.4 43.7 57.7 49.7\n29.8 19.8 29.1 23.6 19.9 52.5 28.9 11.9 76.3 20.5\n20.8 37.2 26.7 24.6 54.3 33.9 13.5 51.4 21.4\n30.2 28.4 60.7 38.7 14.2 75.0 23.9\n\n39.1 25.5\n61.5 40.9 30.5 34.9 41.5 53.3 46.7\n39.2 21.7 34.5 26.7 23.9 60.6 34.3 14.2 78.6 24.1\n63.4 42.4 38.5 40.3 45.9 59.0 51.6 39.7 59.5 47.6\n59.7 48.3 33.1 39.2 46.0 60.3 52.2\n\n37\n\n-\n\n-\n\n-\n\n-\n\n-\n-\n-\n\n-\n-\n-\n\n-\n\n-\n\n-\n-\n-\n\n-\n\n-\n\n223\n224\n225\n226\n227\n228\n229\n230\n\n231\n\n232\n233\n234\n235\n236\n237\n238\n\n239\n240\n241\n242\n243\n244\n245\n246\n247\n248\n249\n250\n251\n\nlimitation, here we take CUB as an example. Both the generators and discriminators are MLP with\nLeakyReLU activation. In the primal GAN, GSV has a single hidden layer containing 4096 nodes and\nan output layer that has a ReLU activation with 2048 nodes, while the discriminator DV contains a\nsingle hidden layer with 4096 nodes and an output layer without activation. GV S and DS in the dual\nGAN have similar architecture with GSV and DV respectively. We use \u03bb1 = \u03bb4 = 10 as suggested\nin [9]. For loss term contributions, we cross-validate and set \u03bb2 = \u03bb3 = \u03bb6 = 0.01, \u03bb5 = 0.1. We\nchoose noise z with the same dimensionality as the class embedding. Our model is optimized by\nAdam [12] with a base learning rate of 1e\u22124.\n\n4.3 Compared Methods and Experimental Results\n\nWe compare DASCN with state-of-the-art GZSL models. These approaches fall into two categories.\n(1) Embedding-based approaches: CMT [24], DEVISE [8], ESZSL [23], SJE [1], SAE [13], LESAE\n[18], SP-AEN [4], RN [25], KERNEL [32], PSR [2], DCN [17], TRIPLE [31]. This category suffers\nfrom the issue of the bias towards seen classes due to the lack of instances of the unseen classes. (2)\nGenerative approaches: SE-GZSL [14], GAZSL [35], f-CLSWGAN [29], Cycle-WGAN [7]. This\ncategory synthesizes visual features of the seen and unseen classes and perform better for GZSL\ncompared to the embedding-based methods.\nTable 2 summarizes the performance of all the comparing methods under three evaluation metrics on\nthe four benchmark datasets, which demonstrates that for all datasets our DASCN model signi\ufb01cantly\nimproves the ts measure and H measure over the state-of-the-arts. Note that Cycle-WGAN [7]\nemploys per-class sentences as class semantic features on CUB dataset rather than per-class attributes\nthat are commonly used by other comparison methods, so its results on CUB may not be directly\ncomparable with others. On CUB, DASCN achieves 45.9% in ts and 51.6% in H, with improvements\nover the state-of-the-art 2.2% and 1.9% respectively. On SUN, it obtains 42.4% in ts measure and\n40.3% in H measure. On AWA1, our model outperforms the runner-up by a considerable gap in H\nmeasure: 1.9%. On aPY, DASCN signi\ufb01cantly achieves improvements over the other best competitors\n25.5% in ts measure and 23.5% in H measure, which is very impressive. The performance boost is\nattributed to the effectiveness of DASCN that imitate discriminative visual features of the unseen\nclasses. In conclusion, our model DASCN achieves a great balance between seen and unseen classes\nclassi\ufb01cation and consistently outperforms the current state-of-the-art methods for GZSL.\n\n7\n\n\fTable 3: Comparison between the reported results of Cycle-WGAN and our model. * indicates\nemploying the same semantic features (per-class sentences (stc)) as Cycle-WGAN on CUB.\nAWA1\n\nCUB*\n\nSUN\n\nFLO\n\nMethod\nts\nCycle-WGAN 59.1 71.1 64.5 46.0\n60.5 80.4 69.0 47.4\nDASCN\n\nH\n\nts\n\ntr\n\ntr\n60.3\n60.1\n\ntr\n\nts\n\nH\nts\n52.2 48.3 33.1 39.2 56.4\n53.0 42.4 38.5 40.3 59.3\n\nH\n\ntr\n63.5\n68.0\n\nH\n59.7\n63.4\n\n252\n253\n254\n255\n256\n257\n258\n259\n260\n261\n262\n263\n264\n\n265\n266\n267\n268\n269\n270\n271\n\n272\n273\n274\n275\n276\n\n277\n\n278\n279\n280\n281\n282\n283\n284\n285\n286\n287\n288\n289\n290\n291\n\n292\n\n293\n294\n295\n296\n\nTo further clarify the advantages of DASCN over Cycle-WGAN [7] in both methodology and\nempirical results, we conduct the following experiments: (1) we use the same semantic features (per-\nclass sentences (stc)) as Cycle-WGAN uses for DASCN on the CUB dataset, (2) we add the FLO [6]\ndataset employed by Cycle-WGAN as a benchmark. As shown in Table 3, results on four benchmarks\nconsistently demonstrate the superiority of DASCN. The main novelty of our work is the integration\nof dual structure mechanism and visual-semantic consistencies into GAN for bidirectional alignment\nand alleviating semantic loss. In contrast, Cycle-WGAN only consists of one GAN and a pre-trained\nregressor, which only minimizes L2 norm between the reconstructed and real semantics. As a result,\nCycle-WGAN is rather weak and unreliable to preserve high-level semantics via the Euclidean\ndistance. Compared to that, thanks to the dual-GAN structure and visual-semantic consistencies loss,\nDASCN explicitly supervises that the generated features have highly discriminative semantic nature\non the high-level aspects and effectively preserve semantics via multi-adversarial learning in both\nform and content perspectives.\nMore speci\ufb01cally, we build two GANs for visual and semantic generation, and design two consistency\nregularizations accordingly: (1) semantic consistency to align the centroid of the synthetic semantics\nand real semantic, (2) visual consistency for not only matching the real visual features but also\nenforcing synthetic semantics to have highly discriminative nature to further generate effective visual\nfeatures. Compared to the Cycle-WGAN that only minimizes L2 norm of reconstructed and real\nsemantics, the novelty being introduced is the tailor-made semantic high-level consistency at a \ufb01ner\ngranularity.\nNote that we not only generate synthetic semantic features from the synthetic visual features, but also\nfurther generate synthetic visual features again based on the synthetic semantic features, which is\nconstrained by visual consistency loss to ensure the generated features have highly discriminative\nsemantic nature. Such bidirectional synthesis procedures boost the quality of synthesized instances\ncollaboratively via dual structure.\n\n4.4 Ablation Study\n\nWe now conduct the ablation study to evaluate the effects of the dual structure, the semantic centroid\nregularization LSC, and the visual consistency constraint LV C. We take the single WGAN model\nf-CLSWGAN as baseline, and train three variants of our model by keeping the single dual structure\nor that adding the only semantic or visual constraint, denoted as Dual-WGAN, Dual-WGAN +LSC,\nDual-WGAN +LV C, respectively. Table 4 shows the performance of each setting, the performance\nof the single Dual-WGAN on the H metric drops drastically by 4.9% on aPY, 1.4% on AWA1, 1.3%\non CUB and 0.7% on SUN, respectively. This clearly highlights the importance of designed semantic\nand visual constraints to provide an explicit supervision to our model. In the case of lacking semantic\nor visual unidirectional constrains, on aPY, our model drops by 1.3% and 3.6% respectively, while\non AWA1 the gap are 0.9% and 0.7%. In general, the three variants of our proposed model tend to\noffer more superior and balanced performance than the baseline. DASCN incorporates dual structure,\nsemantic centroid regularization and visual consistency constraint into a uni\ufb01ed framework and\nachieves the best improvement, which demonstrates that different components promote each other\nand work together to improve the performance of DASCN signi\ufb01cantly.\n\n4.5 Quality of Synthesized Samples\n\nWe perform an experiment to gain a further insight into the quality of the generated samples, which is\none key issue of our approach, although the quantitative results reported for GZSL above demonstrate\nthat the samples synthesized by our model are of signi\ufb01cant effectiveness for GZSL task. Speci\ufb01cally,\nwe randomly sample three unseen categories from aPY and visualize both true visual features and\n\n8\n\n\fTable 4: Effects of different components on four benchmark datasets with GZSL setting.\n\naPY\ntr\n\nH\n\nMethods\nts\nts\nWGAN-baseline\n32.4 57.5 41.4 57.9\nDual-WGAN\n34.1 57.0 42.7 57.5\nDual-WGAN +LSC 35.4 58.2 44.0 57.7\nDual-WGAN +LV C 36.7 62.0 46.3 58.3\n39.7 59.5 47.6 59.3\nDASCN\n\nAWA1\n\nCUB\n\nSUN\n\ntr\n61.4\n67.4\n68.6\n67.3\n68.0\n\ntr\n\nts\n\ntr\n\nts\n\nH\n\nH\nH\n59.6 43.7 57.7 49.7 42.6 36.6 39.4\n62.0 44.5 57.9 50.3 42.7 36.9 39.6\n62.7 44.9 58.5 50.8 42.9 37.3 39.9\n62.5 45.2 59.1 51.2 43.5 36.5 39.7\n63.4 45.9 59.0 51.6 42.4 38.5 40.3\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: (a): t-SNE visualization of real visual feature distribution and synthesized feature distribu-\ntion from randomly selected three unseen classes; (b, c): Increasing the number of samples generated\nby DASCN and its variants wrt harmonic mean H. DASCN w/o SC denotes DASCN without semantic\nconsistency constraint and DASCN w/o VC stands for that without visual consistency constraint.\n\nsynthesized visual features using t-SNE [20]. Figure 3(a) depicts the empirical distributions of the\ntrue visual features and the synthesized visual features. We observe the clear patterns of intra-class\ndiversity and inter-class separability in the \ufb01gure. This intuitively demonstrates that not only the\nsynthesized feature distributions well approximate the true distributions but also our model introduces\na high discriminative power of the synthesized features to a large extent.\nFinally, we evaluate how the number of the generated samples per class affects the performance of\nDASCN and its variants. Obviously, as shown in Figure 3(b) and Figure 3(c), we notice not only that\nH increases with an increasing number of synthesized samples and asymptotes gently, but also that\nDASCN with visual-semantic interactions achieves better performance in all circumstance, which\nfurther validates the superiority and rationality of different components of our model.\n\n5 Conclusion\n\nWe propose DASCN, a novel generative model for GZSL, to address the challenging problem\nwhere existing GZSL approaches either suffer from the semantic loss or cannot guarantee the visual-\nsemantic interactions. DASCN can synthesize inter-class discrimination and semantics-preserving\nvisual features for both seen and unseen classes. The DASCN architecture is novel in that it\nconsists of a primal GAN and a dual GAN to collaboratively promote each other, which captures the\nunderlying data structures of both visual and semantic representations. Thus, our model can effectively\nenhance the knowledge transfer from the seen categories to the unseen ones, and can effectively\nalleviate the inherent semantic loss problem for GZSL. We conduct extensive experiments on four\nbenchmark datasets and compare our model against the state-of-the -art models. The evaluation\nresults consistently demonstrate the superiority of DASCN to state-of-the-art GZSL models.\n\nAcknowledgments\n\nThis research is supported in part by the National Key Research and Development Project (Grant No.\n2017YFC0820503), the National Science and Technology Major Project for IND (investigational\nnew drug) (Project No. 2018ZX09201014), and the CETC Joint Advanced Research Foundation\n(Grant No. 6141B08080101,6141B08010102).\n\n297\n298\n299\n300\n301\n\n302\n303\n304\n305\n306\n\n307\n\n308\n309\n310\n311\n312\n313\n314\n315\n316\n317\n\n318\n\n319\n320\n321\n322\n\n9\n\n501002003004005006003046485052CUB DASCN DASCN w/o sc DASCN w/o vc H(%)# of generated features per class40050010001500200025002800253035404550# of generated features per classH(%)aPY DASCN DASCN w/o sc DASCN w/o vc\f323\n\n324\n325\n326\n\n327\n328\n329\n\n330\n331\n332\n\n333\n334\n335\n\n336\n337\n338\n\n339\n340\n341\n\n342\n343\n344\n\n345\n346\n347\n\n348\n349\n350\n\n351\n352\n\n353\n354\n355\n356\n\n357\n358\n\n359\n360\n361\n\n362\n363\n364\n\n365\n366\n367\n\n368\n369\n370\n\nReferences\n[1] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output\nembeddings for \ufb01ne-grained image classi\ufb01cation. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 2927\u20132936, 2015.\n\n[2] Yashas Annadani and Soma Biswas. Preserving semantic relations for zero-shot learning.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n7603\u20137612, 2018.\n\n[3] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and\nanalysis of generalized zero-shot learning for object recognition in the wild. In European\nConference on Computer Vision, pages 52\u201368. Springer, 2016.\n\n[4] Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, and Shih-Fu Chang. Zero-shot visual\nrecognition using semantics-preserving adversarial embedding networks. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition, pages 1043\u20131052, 2018.\n\n[5] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nAdvances in neural information processing systems, pages 2172\u20132180, 2016.\n\n[6] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes.\nIn 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1778\u20131785.\nIEEE, 2009.\n\n[7] Rafael Felix, Vijay BG Kumar, Ian Reid, and Gustavo Carneiro. Multi-modal cycle-consistent\ngeneralized zero-shot learning. In Proceedings of the European Conference on Computer Vision\n(ECCV), pages 21\u201337, 2018.\n\n[8] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. De-\nvise: A deep visual-semantic embedding model. In Advances in neural information processing\nsystems, pages 2121\u20132129, 2013.\n\n[9] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in neural information processing systems,\npages 5767\u20135777, 2017.\n\n[10] Yuchen Guo, Guiguang Ding, Jungong Han, and Yue Gao. Synthesizing samples fro zero-shot\n\nlearning. IJCAI, 2017.\n\n[11] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to\ndiscover cross-domain relations with generative adversarial networks. In Proceedings of the\n34th International Conference on Machine Learning-Volume 70, pages 1857\u20131865. JMLR. org,\n2017.\n\n[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[13] Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3174\u20133183, 2017.\n\n[14] Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. Generalized zero-shot\nlearning via synthesized examples. In Proceedings of the IEEE conference on computer vision\nand pattern recognition, pages 4281\u20134289, 2018.\n\n[15] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen\nobject classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision\nand Pattern Recognition, pages 951\u2013958. IEEE, 2009.\n\n[16] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classi\ufb01cation\nfor zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 36(3):453\u2013465, 2013.\n\n10\n\n\f371\n372\n373\n\n374\n375\n\n376\n377\n378\n\n379\n380\n\n381\n382\n383\n\n384\n385\n\n386\n387\n\n388\n389\n390\n\n391\n392\n393\n\n394\n395\n396\n\n397\n398\n\n399\n400\n401\n\n402\n403\n404\n\n405\n406\n407\n\n408\n409\n\n410\n411\n\n412\n413\n414\n\n415\n416\n417\n\n[17] Shichen Liu, Mingsheng Long, Jianmin Wang, and Michael I Jordan. Generalized zero-shot\nlearning with deep calibration network. In Advances in Neural Information Processing Systems,\npages 2005\u20132015, 2018.\n\n[18] Yang Liu, Quanxue Gao, Jin Li, Jungong Han, and Ling Shao. Zero shot learning via low-rank\n\nembedded semantic autoencoder. In IJCAI, pages 2490\u20132496, 2018.\n\n[19] Yang Long, Li Liu, Ling Shao, Fumin Shen, Guiguang Ding, and Jungong Han. From zero-shot\nlearning to conventional supervised classi\ufb01cation: Unseen visual data synthesis. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition, pages 1627\u20131636, 2017.\n\n[20] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine\n\nlearning research, 9(Nov):2579\u20132605, 2008.\n\n[21] Genevieve Patterson and James Hays. Sun attribute database: Discovering, annotating, and\nIn 2012 IEEE Conference on Computer Vision and Pattern\n\nrecognizing scene attributes.\nRecognition, pages 2751\u20132758. IEEE, 2012.\n\n[22] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\ndeep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[23] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot\n\nlearning. In International Conference on Machine Learning, pages 2152\u20132161, 2015.\n\n[24] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning\nthrough cross-modal transfer. In Advances in neural information processing systems, pages\n935\u2013943, 2013.\n\n[25] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales.\nLearning to compare: Relation network for few-shot learning. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition, pages 1199\u20131208, 2018.\n\n[26] Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot recognition via semantic embeddings\nand knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 6857\u20136866, 2018.\n\n[27] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie,\n\nand Pietro Perona. Caltech-ucsd birds 200. 2010.\n\n[28] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning-a\ncomprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern\nanalysis and machine intelligence, 2018.\n\n[29] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks\nfor zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 5542\u20135551, 2018.\n\n[30] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for\nimage-to-image translation. In Proceedings of the IEEE international conference on computer\nvision, pages 2849\u20132857, 2017.\n\n[31] Haofeng Zhang, Yang Long, Yu Guan, and Ling Shao. Triple veri\ufb01cation network for general-\n\nized zero-shot learning. IEEE Transactions on Image Processing, 28(1):506\u2013517, 2018.\n\n[32] Hongguang Zhang and Piotr Koniusz. Zero-shot kernel learning. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 7670\u20137679, 2018.\n\n[33] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding model for zero-shot\nlearning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 2021\u20132030, 2017.\n\n[34] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image\ntranslation using cycle-consistent adversarial networks. In Proceedings of the IEEE international\nconference on computer vision, pages 2223\u20132232, 2017.\n\n11\n\n\f418\n419\n420\n\n[35] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative\nIn Proceedings of the IEEE\n\nadversarial approach for zero-shot learning from noisy texts.\nconference on computer vision and pattern recognition, pages 1004\u20131013, 2018.\n\n12\n\n\f", "award": [], "sourceid": 3314, "authors": [{"given_name": "Jian", "family_name": "Ni", "institution": "University of Science and Technology of China"}, {"given_name": "Shanghang", "family_name": "Zhang", "institution": "Carnegie Mellon University"}, {"given_name": "Haiyong", "family_name": "Xie", "institution": "University of Science and Technology of China"}]}