{"title": "Zero-Shot Semantic Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 468, "page_last": 479, "abstract": "Semantic segmentation models are limited in their ability to scale to large numbers of object classes. In this paper, we introduce the new task of zero-shot semantic segmentation: learning pixel-wise classifiers for never-seen object categories with zero training examples. To this end, we present a novel architecture, ZS3Net, combining a deep visual  segmentation model with an approach to generate visual representations from semantic word embeddings. By this way, ZS3Net addresses pixel classification tasks where both seen and unseen categories are faced at test time (so called generalized zero-shot classification). Performance is further improved by a self-training step that relies on automatic pseudo-labeling of pixels from unseen classes. On the two standard segmentation datasets, Pascal-VOC and Pascal-Context, we propose zero-shot benchmarks and set competitive baselines. For complex scenes as ones in the Pascal-Context dataset, we extend our approach by using a graph-context encoding to fully leverage spatial context priors coming from class-wise segmentation maps.", "full_text": "Zero-Shot Semantic Segmentation\n\nMaxime Bucher\n\nvaleo.ai\n\nmaxime.bucher@valeo.com\n\nMatthieu Cord\n\nSorbonne Universit\u00e9\n\nvaleo.ai\n\nmatthieu.cord@lip6.fr\n\nTuan-Hung Vu\n\nvaleo.ai\n\ntuan-hung.vu@valeo.com\n\nPatrick P\u00e9rez\n\nvaleo.ai\n\npatrick.perez@valeo.com\n\nAbstract\n\nSemantic segmentation models are limited in their ability to scale to large numbers\nof object classes. In this paper, we introduce the new task of zero-shot semantic\nsegmentation: learning pixel-wise classi\ufb01ers for never-seen object categories with\nzero training examples. To this end, we present a novel architecture, ZS3Net,\ncombining a deep visual segmentation model with an approach to generate visual\nrepresentations from semantic word embeddings. By this way, ZS3Net addresses\npixel classi\ufb01cation tasks where both seen and unseen categories are faced at test\ntime (so called \u201cgeneralized\u201d zero-shot classi\ufb01cation). Performance is further\nimproved by a self-training step that relies on automatic pseudo-labeling of pixels\nfrom unseen classes. On the two standard segmentation datasets, Pascal-VOC and\nPascal-Context, we propose zero-shot benchmarks and set competitive baselines.\nFor complex scenes as ones in the Pascal-Context dataset, we extend our approach\nby using a graph-context encoding to fully leverage spatial context priors coming\nfrom class-wise segmentation maps.\n\n1\n\nIntroduction\n\nSemantic segmentation has achieved great progress using convolutional neural networks (CNNs).\nEarly CNN-based approaches classify region proposals to generate segmentation predictions [17].\nFCN [28] was the \ufb01rst framework adopting fully convolutional networks to address the task in an end-\nto-end manner. Most recent state-of-the-art models like UNet [37], SegNet [3], DeepLabs [10, 11],\nPSPNet [50] are FCN-based. An effective strategy for semantic segmentation is to augment CNN\nfeatures with contextual information, e.g. using atrous/dilated convolution [10, 46], pyramid context\npooling [50] or a context encoding module [47].\nSegmentation approaches are mainly supervised, but there is an increasing interest in weakly-\nsupervised segmentation models using annotations at the image-level [33, 34] or box-level [13]. We\npropose in this paper to investigate a complementary learning problem where part of the classes are\nmissing altogether during the training. Our goal is to re-engineer existing recognition architectures\nto effortlessly accommodate these never-seen, a.k.a. unseen, categories of scenes and objects. No\nmanual annotations or real samples, only unseen labels are needed during training. This line of works\nis usually coined zero-shot learning (ZSL).\nZSL for image classi\ufb01cation has been actively studied in recent years. Early approaches address it as\nan embedding problem [1, 2, 6, 8, 16, 24, 32, 36, 39, 42, 43, 48]. They learn how to map image data\nand class descriptions into a common space where semantic similarity translates into spacial proximity.\nThere are different variants in the literature on how the projections or the similarity measure are\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fInput image\n\nw/o Zero-Shot Segmentation\n\nOur ZS3Net Segmentation\n\nFigure 1: Introducing and addressing zero shot semantic segmentation. In this example, there are no\n\u2018motorbike\u2019 examples in the training set. As a consequence, a supervised model (middle) fails on this object,\nseeing it as a mix of the seen classes person, bicycle and background. With proposed ZS3Net method (right),\npixels of the never-seen motorbike class are recognized.\n\ncomputed: simple linear projection [1, 2, 6, 16, 24, 36], non-linear multi-modal embeddings [39, 43]\nor even hybrid methods [8, 32, 42, 48].\nRecently, [7, 25, 45] proposed to generate synthetic instances of unseen classes by training a condi-\ntional generator from the seen classes. There are very few extensions of zero-shot learning to other\ntasks than classi\ufb01cation. Very recently [4, 14, 35, 51] attacked object detection: unseen objects are\ndetected while box annotation is only available for seen classes. As far as we know, there is no\napproach considering zero-shot setting for segmentation.\nIn this paper, we introduce the new task of zero-shot semantic segmentation (ZS3) and propose an\narchitecture, called ZS3Net, to address it: Inspired by most recent zero shot classi\ufb01cation apporaches,\nwe combine a backbone deep net for image embedding with a generative model of class-dependent\nfeatures. This allows the generation of visual samples from unseen classes, which are then used to\ntrain our \ufb01nal classi\ufb01er with real visual samples from seen classes and synthetic ones from unseen\nclasses. Figure 1 illustrates the potential of our approach for visual segmentation. We also propose a\nnovel self-training step in a relaxed zero-shot setup where unlabelled pixels from unseen classes are\nalready available at training time. The whole zero-shot pipeline including self-training is coined as\nZS5Net (ZS3Net with Self-Supervision) in this work.\nLastly, we further extend our model by exploiting contextual cues from spatial region relationship.\nThis strategy is motivated by the fact that similar objects not only share similar properties but also\nsimilar contexts. For example, \u2018cow\u2019 and \u2018horse\u2019 are often seen in \ufb01elds while most \u2018motorbike\u2019 and\n\u2018bicycle\u2019 show in urban scenes.\nWe report evaluations of ZS3Net on two datasets (Pascal-VOC and Pascal-Context) and in zero-shot\nsetups with varying numbers of unseen classes. Compared to a ZSL baseline, our method delivers\nexcellent performances, which are further boosted using self-training and semantic contextual cues.\n\n2 Zero-shot semantic segmentation\n\n2.1\n\nIntroduction to our strategy\n\nZero-shot learning addresses recognition problems where not all the classes are represented in the\ntraining examples. This is made possible by using a high-level description of the categories that helps\nrelate the new classes (the unseen classes) to classes for which training examples are available (the\nseen classes). Learning is usually done by leveraging an intermediate level of representation, which\nprovides semantic information about the categories to classify.\nA common idea is to transfer semantic similarities between linguistic identities from some suitable\ntext embedding space to a visual representation space. Effectively, classes like \u2018zebra\u2019 and \u2018donkey\u2019\nthat share a lot of semantic attributes are likely to stay closer in the representation space than very\ndifferent classes, \u2018bird\u2019 and \u2018television\u2019 for instance. Such a joint visual-text perspective enables\nstatistical training of zero-shot recognition models.\nWe address in this work the problem of learning a semantic segmentation network capable of\ndiscriminating between a given set of classes where training images are only available for a subset\nof it. To this end, we start from an existing semantic segmentation model trained with a supervised\nloss on seen data (DeepLabv3+ in Fig. 2). This model is limited to trained categories and, hence,\nunable to recognize new unseen classes. In Figure 1-(middle) for instance, the person (a seen class) is\n\n2\n\n\fFigure 2: Our deep ZS3Net for zero-shot semantic segmentation. The \ufb01gure is separated into two parts\nby colors corresponding to: (1) training generative model and (2) \ufb01ne-tuning classi\ufb01cation layer. In (1), the\ngenerator, conditioned on the word2vec (w2c) embedding of seen classes\u2019 labels, learns to generate synthetic\nfeatures that match real DeepLab\u2019s ones on seen classes. Later in (2), the classi\ufb01er is trained to classify real\nfeatures from seen classes and synthetic ones from unseen classes. At run-time, the classi\ufb01er operates on real\nDeepLab features stemming from both types of classes.\n\ncorrectly segmented unlike the motorbike (an unseen class) whose pixels are wrongly predicted as a\nmix of \u2018person\u2019, \u2018bicycle\u2019 and \u2018background\u2019.\nTo allow the semantic segmentation model to recognize both seen and unseen categories, we propose\nto generate synthetic training data for unseen classes. This is obtained with a generative model\nconditioned on the semantic representation of target classes. This generator outputs pixel-level\nmulti-dimensional features that the segmentation model relies on (blue zone 1 in Fig. 2).\nOnce the generator is trained, many synthetic features can be produced for unseen classes, and\ncombined with real samples from seen classes. This new set of training data is used to retrain the\nclassi\ufb01er of the segmentation network (orange zone 2 in Fig. 2) so that it can now handle both seen\nand unseen classes. At test time, an image is passed through the semantic segmentation model\nequipped with the retrained classi\ufb01cation layer, allowing prediction for both seen and unseen classes.\nFigure 1-(right) shows a result after this procedure, with the model now able to delineate correctly\nthe motorbike category.\n\n2.2 Architecture and learning of ZS3Net\nWe denote the set of all classes as C = S \u222a U, with S the set of seen classes and U the set of unseen\nones, where S \u2229 U = \u2205. Each category c \u2208 C can be mapped through word embedding to a vector\nrepresentation a[c] \u2208 Rda of dimension da. In the experiments, we will use to this end the \u2018word2vec\u2019\nmodel [30] learned on a dump of the Wikipedia corpus (approx. 3 billion words). This popular\nembedding is based on the skip-gram language model which is learned through predicting the context\n(nearby words) of words in the target dictionary. As a result of this training strategy, words that\nfrequently share common contexts in the corpus are located in close proximity in the embedded\nvector space. In other words, this semantic representation is expected to capture geometrically the\nsemantic relationship between the classes of interest.\nIn FCN-based segmentation frameworks, the input images are forwarded through an encoding stack\nconsisting of fully convolutional layers, which results in smaller feature maps compared to the\noriginal resolution. Effectively, logit prediction maps have small resolutions and often require an\nadditional up-sampling step to match the input size [10, 11, 28]. In the current context, with a slight\nabuse of notation, we attach each spatial location on the encoded feature map to a pixel of input\nimage down-sampled to a comparable size. We can therefore assign class labels to encoded features\n\n3\n\n\fi , as\n\ni )} be composed of triplets where xs\n\nand construct training data in the feature space. From now on, \u2018pixel\u2019 will refer to pixel locations in\nthis down-sampled image.\nDe\ufb01nition and collection of pixel-wise data (step 0). We start from DeepLabv3+ semantic seg-\nmentation model [11], pre-trained with a supervised loss on annotated data from seen classes. Based\non this architecture, we need to choose suitable features, out of several feature maps that can be\nused independently for classi\ufb01cation. Conversely, the classi\ufb01er to be later \ufb01ne-tuned must be able\nto operate on individual pixel-wise features. As a result, we choose the last 1 \u00d7 1 convolutional\nclassi\ufb01cation layer of DeepLabv3+ and the features it ingests as our classi\ufb01er f and targeted features\nx respectively.\ni \u2208 RM\u00d7N\u00d7dx is a\nLet the training set Ds = {(xs\ni , ys\ni \u2208 SM\u00d7N is the associated ground-truth segmentation map and\ndx-dimensional feature map, ys\ni \u2208 RM\u00d7N\u00d7da is the class embedding map that associates to each pixel the semantic embedding of\nas\nits class.\nWe note that M \u00d7 N is the resolution of the encoded feature maps, as well as of the down-sampled\nimage and segmentation ground-truth.1 For the K = |U| unseen classes, no training data is available,\nonly the category embeddings a[c], c \u2208 U.\nOn seen data, the DeepLabv3+ model is trained with full-supervision using the standard cross-entropy\nloss. After this training phase, we remove the last classi\ufb01cation layer and only use the remaining\nnetwork for extracting seen features, as illustrated in blue part (1) of Figure 1. To avoid supervision\nleakage from unseen classes [44], we retrain the backbone network on ImageNet [38] on seen classes\nsolely. Details are given later in Section 3.1.\nGenerative model (step 1). Key to our approach is the ability to generate image features condi-\ntioned on a class embedding vector, without access to any images of this class. Given a random sample\nz from a \ufb01xed multivariate Gaussian distribution, and the semantic description a, new pixel features\n\nwill be generated as(cid:98)x = G(a, z; w) \u2208 Rdx, where G is a trainable generator with parameters w.\n\nmeasure between two probability distributions):\n\nToward this goal, we can leverage any generative approach like GAN [18], GMMN [27] or VAE [22].\nThis feature generator is trained under supervision of features from seen classes.\nWe follow [7] and adopt the \u201cGenerative Moment Matching Network\u201d (GMMN) [27] for the feature\ngenerator. GMNN is a parametric random generative process G using a differential criterion to\ncompare the target data distribution and the generated one. The generative process will be considered\nas good if, for each semantic description a, two random populations X (a) \u2282 Rdx from Ds and\n\n(cid:98)X (a; w) sampled with the generator have low maximum mean discrepancy (a classic divergence\nLGMMN(a) = (cid:80)\n(cid:80)(cid:98)x\u2208(cid:98)X (a,w) k(x,(cid:98)x),\n\nx,x(cid:48)\u2208X (a) k(x, x(cid:48)) +(cid:80)(cid:98)x,(cid:98)x(cid:48)\u2208(cid:98)X (a;w) k((cid:98)x,(cid:98)x(cid:48)) \u2212 2(cid:80)\n\nx\u2208X (a)\nwhere k is a kernel that we choose as Gaussian, k(x, x(cid:48)) = exp(\u2212 1\n2\u03c32(cid:107)x \u2212 x(cid:48)(cid:107)2) with band-\nwidth parameter \u03c3. The parameters w of the generative network are optimized by Stochastic Gradient\nDescent [5].\nClassi\ufb01cation model (step 2). Similar to DeepLabv3+, the classi\ufb01cation layer f consists of a\n1 \u00d7 1 convolutional layer. Once G is trained in step 1, arbitrarily many pixel-level features can be\nsampled for any classes, unseen ones in particular. We build this way a synthetic unseen training set\nj )} of triplets in Rdx \u00d7U \u00d7Rda. Combined with the real features from seen classes\nin Ds, this set of synthetic features for unseen categories allows the \ufb01ne-tuning of the classi\ufb01cation\n\n(cid:98)Du = {((cid:98)xu\nlayer f. The new pixel-level classi\ufb01er for categories in C becomes(cid:98)y = f (x;(cid:98)Du,Ds). It can be used\n\nto conduct the semantic segmentation of images that exhibit objects from both types of classes.\nZero-shot learning and self-training. Self-training is a useful strategy in semi-supervised learning\nthat leverages a model\u2019s own predictions on unlabelled data to heuristically obtain additional pseudo-\nannotated training data [52]. Assuming that unlabelled images with objects from unseen classes are\nnow available (this is thus a relaxed setting compared to pure ZSL), such a self-supervision can be\nmobilized to improve our zero-shot model. The trained ZS3Net (one gotten after step 2) can indeed\nbe used to \u201cannotate\u201d these additional images automatically, and for each one, the top p% of the most\ncon\ufb01dent among these pseudo-labels provide new training features for unseen classes. The semantic\nsegmentation network is then retrained accordingly. We coin this new model ZS5Net for ZS3Net\nwith Self-Supervision.\n\nj , yu\n\nj , au\n\n1Final softmax and decision layers operate in effect after logits are up-sampled back to full resolution.\n\n4\n\n\fNote that there exists a connection between ZS5Net and transductive zero-shot learning [26, 40, 49],\nbut that our model is not transductive. Indeed, under purely transductive settings, no data even\nunlabelled, is available at train time for unseen classes. Differently, our ZS5Net learns from a mix of\nlabelled and unlabelled training data, and is evaluated on a different test set (effectively, a form of\nsemi-supervised learning).\n\nFigure 3: Graph-context encoding in the generative pipeline. The segmentation mask is encoded as an\nadjacency graph of semantic connected components (represented as nodes with different colors in the graph).\nEach semantic node is attached to its corresponding word2vec embedding vector. The generative process is\nconditioned on this graph. The generated output is also a graph with the same structure as the input\u2019s, except\nthat attached to each output node is a generated visual feature. Best viewed in color.\nGraph-context encoding. Understanding and utilizing contextual information is very important\nfor semantic segmentation, especially in complex scenes. Indeed, by design, FCN-based architectures\nalready encode context with convolutional layers. For more context encoding, DeepLabv3+ applies\nseveral parallel dilated convolutions at different rates. In order to re\ufb02ect this mechanism we propose to\nleverage contextual cues for our feature generation. To this end, we introduce a graph-based method\nto encode the semantic context for complex scenes with lots of objects as ones in the Pascal-Context\ndataset.\nIn general, the structural object arrangement contains informative cues for recognition. For example,\nit is common to see \u2018dog\u2019 sitting on \u2018chairs\u2019 but very rare to have \u2018horses\u2019 doing the same thing. Such\nspatial priors are naturally captured by a form of relational graph, that can be constructed in different\nways, e.g., using manual sketches or semantic segmentation masks of synthetic scenes. What matters\nmost is the relative spatial arrangement, not the precise shapes of the objects. As a proof of concept,\nwe simply exploit true segmentation masks during the training of the generative model. It it important\nto note that the images associated to these masks are not used during training if they contain unseen\nclasses.\nA segmentation mask is represented by the adjacency graph G = (V,E) of its semantic connected\ncomponents: each node corresponds to one connected group of single class labels, and two such\ngroups are neighbors if they share a boundary (hence being of two different classes). We re-design the\ngenerator to accept G as additional input using graph convolutional layers [23]. As shown in Figure 3,\neach input node \u03bd \u2208 V is represented by concatenating its corresponding semantic embedding a\u03bd\nwith a random Gaussian sample z\u03bd. This modi\ufb01ed generator outputs features attached to the nodes of\nthe input graph.\n\n3 Experiments\n\n3.1 Experimental details\nDatasets. Experimental evaluation is done on the two datasets: Pascal-VOC 2012 [15] and Pascal-\nContext [31]. Pascal-VOC contains 1, 464 training images with segmentation annotations of 20 object\nclasses. Similar to [11], we adopt additional supervision from semantic boundary annotations [19]\nduring training. Pascal-Context provides dense semantic segmentation annotations for Pascal-VOC\n2010, which comprises 4, 998 training and 5, 105 validation images of 59 object/stuff classes.\nZero-shot setups. We consider different zero-shot setups varying in number of unseen classes,\nwe randomly construct the 2-, 4-, 6-, 8- and 10-class unseen sets. We extend the unseen set in an\nincremental manner, meaning that for instance the 4- unseen set contains the 2-. Details of the splits\nare as follows:\n\n5\n\n\fTable 1: Zero-shot semantic segmentation on Pascal-VOC.\n\nSeen\n\nUnseen\n\nOverall\n\nK Model\n\nSupervised\nBaseline\nZS3Net\nBaseline\nZS3Net\nBaseline\nZS3Net\nBaseline\nZS3Net\nBaseline\nZS3Net\n\n2\n\n4\n\n6\n\n8\n\n10\n\nPA MA mIoU\n\u2013\n68.1\n72.0\n64.3\n66.4\n39.8\n47.3\n35.7\n29.2\n31.7\n33.9\n\n\u2013\n79.8\n84.9\n72.6\n78.3\n45.1\n52.1\n41.3\n31.6\n33.9\n37.4\n\n\u2013\n92.1\n93.6\n89.9\n92.0\n79.5\n85.5\n75.8\n81.6\n68.7\n82.7\n\nPA MA mIoU\n\u2013\n3.2\n35.4\n2.9\n23.2\n2.7\n24.2\n2.0\n22.9\n1.9\n18.1\n\n\u2013\n10.5\n53.7\n10.1\n45.7\n8.4\n60.7\n5.7\n62.3\n5.8\n45.7\n\n\u2013\n11.3\n52.8\n10.3\n43.1\n8.3\n67.3\n6.9\n68.7\n6.7\n55.2\n\nPA MA mIoU hIoU\n94.7\n\u2013\n6.1\n89.7\n47.5\n92.7\n5.5\n86.3\n89.8\n34.4\n5.1\n71.1\n32.0\n84.2\n3.8\n68.3\n25.7\n80.3\n60.1\n3.6\n23.6\n79.6\n\n76.9\n44.1\n68.5\n38.9\n58.2\n33.4\n40.7\n24.3\n26.8\n16.9\n26.3\n\n87.2\n73.4\n81.9\n62.1\n72.1\n38.4\n54.6\n34.7\n43.3\n26.9\n41.4\n\nModels\nBaseline\nZS3Net\n\nGeneralized eval. Vanilla eval.\n\n1.9\n18.1\n\n41.7\n46.2\n\nTable 2: Generalized- vs. vanilla ZSL eval-\nuation. Results are reported with mIoU met-\nric on the 10-unseen split from Pascal-VOC\ndataset.\n\nPascal-VOC: 2-cow/motorbike, 4-airplane/sofa, 6-cat/tv, 8-train/bottle, 10-chair/potted-plant;\nPascal-Context: 2-cow/motorbike, 4-sofa/cat, 6-boat/fence, 8-bird/tvmonitor, 10-keyboard/aeroplane.\n\nEvaluation metrics.\nIn our experiments we adopt standard semantic segmentation metrics [28],\ni.e. pixel accuracy (PA), mean accuracy (MA) and mean intersection-over-union (mIoU). Similar\nto [44], we also report harmonic mean (hIoU) of seen and unseen mIoUs. The reason behind choosing\nharmonic rather than arithmetic mean is that seen classes often have much higher mIoUs, which will\nsigni\ufb01cantly dominate the overall result. As we expect ZS3 models to produce good performance for\nboth seen and unseen classes, the harmonic mean is an interesting indicator.\nA zero-shot semantic segmentation baseline. As a baseline, we adapt the ZSL classi\ufb01cation\napproach in [16] to our task. To this end, we \ufb01rst modify the vanilla segmentation network, i.e.\nDeepLabv3+, to not produce class-wise probabilistic predictions for all the pixels, but instead to\nregress corresponding semantic embedding vectors. Effectively, the last classi\ufb01cation layer of\nDeepLabv3+ is replaced by a projection layer which transforms 256\u2212channel features maps into\n300\u2212channel word2vec embedding maps. The model is trained to maximize the cosine similarity\nbetween the output and target embeddings.\nAt run-time, the label predicted at a pixel is the one with the text embedding which the most similar\n(in cosine sense) to the regressed embedding for this pixel.\nImplementation details. We adopt the DeepLabv3+ framework [11] built upon the ResNet-101\nbackbone [20]. Segmentation models are trained by SGD [5] optimizer using polynomial learning\nrate decay with the base learning rate of 7e\u22123, weight decay 5e\u22124 and momentum 0.9.\nThe GMMN is a multi-layer perceptron with one hidden layer, leaky-RELU non-linearity [29]\nand dropout [41]. In our experiments, we \ufb01x the number of hidden neurons to 256 and set the\nkernel bandwidths as {2, 5, 10, 20, 40, 60}. These hyper-parameters are chosen with the \u201czero-shot\ncross-validation procedure\u201d in [7]. The input Gaussian noise has the same dimension as used w2c\nembeddings, namely 300. The generative model is trained using Adam optimizer [21] with the\nlearning rate of 2e\u22124. To encode the graph context as described in Section 2.2, we replace the linear\nlayers in GMMN by graph convolutional layers [23], with no change in other hyper-parameters.\n\n3.2 Zero-shot semantic segmentation\n\nWe report in Tables 1 and 3 results on Pascal-VOC and Pascal-Context datasets, according to the\nthree metrics. Instead of only evaluating on the unseen set (which does not show the strong prediction\nbias toward seen classes), we jointly evaluate on all classes and report results for seen, unseen, and all\nclasses. Such an evaluation protocol is more challenging for zero-shot settings, known as \u201cgeneralized\n\n6\n\n\fTable 3: Zero-shot semantic segmentation on Pascal-Context.\n\nSeen\n\nUnseen\n\nOverall\n\nK Model\n\n\u2013\nSupervised\n70.2\nBaseline\nZS3Net\n71.6\nZS3Net + GC 73.0\nBaseline\n66.2\nZS3Net\n68.4\nZS3Net + GC 70.3\n60.8\nBaseline\nZS3Net\n63.3\nZS3Net + GC 64.5\n54.1\nBaseline\nZS3Net\n51.4\nZS3Net + GC 53.0\nBaseline\n50.0\n53.5\nZS3Net\nZS3Net + GC 50.3\n\n2\n\n4\n\n6\n\n8\n\n10\n\nPA MA mIoU\n\u2013\n35.8\n41.6\n41.5\n33.4\n37.2\n39.5\n31.9\n32.1\n34.8\n22.0\n20.9\n22.8\n17.5\n20.8\n24.0\n\n\u2013\n47.7\n52.4\n52.9\n37.9\n46.1\n49.1\n36.7\n38.0\n42.7\n24.7\n23.9\n27.1\n20.8\n23.8\n27.9\n\nPA MA mIoU\n\u2013\n2.7\n21.6\n30.0\n2.5\n24.9\n29.1\n2.1\n20.7\n21.6\n1.7\n16.0\n16.8\n1.3\n12.7\n14.1\n\n\u2013\n10.2\n46.2\n62.2\n8.4\n53.3\n56.3\n8.0\n55.8\n53.3\n6.8\n59.9\n61.1\n5.0\n43.2\n47.8\n\n\u2013\n9.5\n49.3\n65.8\n9.0\n58.4\n61.0\n8.8\n63.6\n57.2\n7.3\n68.2\n68.5\n5.7\n58.6\n62.6\n\nPA MA mIoU hIoU\n73.9\n\u2013\n5.0\n66.2\n28.4\n71.2\n34.8\n72.6\n4.7\n62.8\n67.8\n29.8\n33.5\n69.0\n3.9\n55.9\n25.2\n63.3\n64.2\n26.7\n3.2\n49.1\n18.1\n53.1\n19.3\n54.6\n2.4\n45.1\n52.8\n15.8\n17.8\n51.2\n\n42.2\n33.1\n41.0\n41.3\n30.7\n36.4\n38.6\n28.8\n30.9\n33.5\n19.2\n20.3\n22.0\n14.3\n19.4\n22.3\n\n52.4\n43.9\n52.2\n53.1\n34.6\n46.6\n49.7\n33.5\n39.8\n43.7\n20.9\n28.7\n31.4\n16.8\n27.0\n31.0\n\nZSL evaluation\u201d [9]. In both Tables 1 and 3, we report in the \ufb01rst line the \u2018oracle\u2019 performance of the\nmodel trained with full-supervision on the complete dataset (including both seen and unseen).\n\nPascal-VOC. Table 1 reports segmentation performance on 5 different splits comparing the ZSL\nbaseline and our approach ZS3Net. We observe that the embedding-based ZSL baseline (DeViSe [16]),\nwhile nicely performing on seen classes, produces much worse results for the unseen. We conducted\nadditional experiments, adapting ALE [1] with K = 2. They yielded 68.1% and 4.6% mIoU for seen\nand unseen classes (harmonic mean of 8.6%), on a par with the DeViSe-based baseline.\nNot strongly harmed by the bias toward seen classes, the proposed ZS3Net provides signi\ufb01cant gains\n(in PA, MA and most importantly mIoU) on the unseen classes, e.g. +32.2% mIoU in the 2-split,\nsimilarly large gaps in other splits. As for seen classes, the ZS3Net performs comparably to the\nZSL baseline, with slight improvement in some splits. Overall, we have favourable results on all the\nclasses using our generative approach. The last column of Table 1 shows harmonic mean of the seen\nand unseen mIoUs, denoted as hIoU (%). As discussed in 3.1, the harmonic mean is a good indicator\nof effective zero-shot performance. Again according to this metric, ZS3Net outperforms the baseline\nby signi\ufb01cant margins.\nAs mentioned in 2.2, our framework is agnostic to the choice of the generative model. We experi-\nmented a variant of ZS3Net based on GAN [45], which turned out to be on a par with the reported\nGMMN-based one. In our experiments, GMMN was chosen due to its better stability.\nIn all experiments, we only report results with the generalized ZSL evaluation. Table 2 shows\nthe difference between this evaluation protocol and the common vanilla one. In the vanilla case,\nprediction scores of seen classes are completely ignored, only unseen scores are used to classify the\nunseen objects. As a result, this evaluation protocol does not show how well the models discriminate\nthe seen and unseen pixels. In zero-shot settings, predictions are mostly biased toward seen classes\ngiven the strong supervision during training. To clearly re\ufb02ect such a bias, the generalized ZSL\nevaluation is a better choice. We see that the ZSL baseline achieves reasonably good result using the\nvanilla ZSL evaluation while showing much worse \ufb01gures with the generalized one. With ZS3Net,\nleveraging both real features from seen classes and synthetic ones from unseen classes to train the\nclassi\ufb01er helps reduce the performance bias toward the seen set.\nThe two \ufb01rst examples in Fig. 4 illustrate the merit of our approach. Model trained only on seen\nclasses (\u2018w/o ZSL\u2019) interprets unseen objects as background or as one of the seen classes. For\nexample, \u2018cat\u2019 and \u2018plane\u2019 (unseen) are detected as \u2018dog\u2019 and \u2018boat\u2019 (seen); a large part of the \u2018plane\u2019\nis considered as \u2018background\u2019. The proposed ZS3Net correctly recognizes these unseen objects.\nPascal-Context.\nIn Table 3, we provide results on the Pascal-Context dataset, a more challenging\nbenchmark compared to Pascal-VOC. Indeed Pascal-Context scene pixels are densely annotated with\n\n7\n\n\f(a) Input image\n\n(b) GT\n\n(c) w/o ZSL\n\n(d) ZS3Net segmentation\n\nFigure 4: Qualitative results on Pascal-VOC and Pascal-Context. (a) Input image, (b) semantic segmen-\ntation ground-truth, (c) segmentation without zero-shot learning, (d) results with proposed ZS3Net. Unseen\nclasses: plane cat cow boat; Some seen classes: dog bird horse dining-table. Best viewed in color.\n\nTable 4: Zero-shot with self-training. ZS5Net results on Pascal-VOC and Pascal-Context datasets.\n\nK Dataset\n\n2\n\n4\n\n6\n\nVOC\nContext\nVOC\nContext\nVOC\nContext\nVOC\nContext\n\n8\n10 VOC\n\nContext\n\nSeen\n\nPA MA mIoU\n75.7\n94.3\n41.8\n72.9\n93.9\n74.0\n37.5\n66.0\n71.2\n93.8\n34.6\n59.7\n92.6\n68.3\n28.5\n51.8\n72.3\n90.1\n46.8\n27.0\n\n85.2\n53.6\n84.8\n46.3\n82.0\n42.2\n77.8\n34.1\n83.9\n32.3\n\nUnseen\n\nPA MA mIoU\n75.8\n89.5\n55.5\n81.0\n57.5\n53.0\n45.1\n86.4\n53.1\n68.3\n36.0\n80.9\n68.2\n50.0\n24.1\n76.2\n34.5\n57.8\n70.2\n20.7\n\n89.9\n78.1\n62.9\n82.8\n61.2\n76.8\n62.0\n71.3\n48.0\n57.1\n\nOverall\n\nPA MA mIoU hIoU\n75.7\n94.2\n47.7\n71.8\n92.4\n61.8\n41.0\n68.0\n60.8\n92.1\n35.2\n62.1\n90.2\n57.7\n26.1\n54.3\n46.7\n86.8\n49.5\n23.4\n\n75.8\n42.0\n69.8\n38.0\n66.1\n35.2\n61.3\n27.8\n54.4\n26.0\n\n85.9\n50.6\n80.9\n48.5\n75.8\n45.8\n71.9\n39.5\n66.9\n36.4\n\n59 object/stuff classes, compared to only a few annotated objects (most of the time 1-2 objects) per\nscene in Pascal-VOC. As a result, segmentation models universally report lower performance on\nPascal-Context. Regardless of such difference, we observe similar behaviors from the ZSL baseline\nand our method. The ZS3Net outperforms the baseline by signi\ufb01cant margins on all evaluation metrics.\nWe emphasize the important improvements on the seen classes, as well as the overall harmonic mean\nof the seen and unseen mIoUs. We visualize in the two last rows of Figure 4 qualitative ZS3 results\non Pascal-Context. Unseen objects, i.e. \u2018cow\u2019 and \u2018boat\u2019, while being wrongly classi\ufb01ed as some\nseen classes without ZSL, can be fully recognized by our zero-shot framework.\nAs mentioned above, Pascal-Context scenes are more complex with much denser object annotations.\nWe argue that, in this case, context priors on object arrangement convey bene\ufb01cial cues to improve\nrecognition performance. We have introduced a novel graph-context mechanism to encode such\nprior in Section 2.2. Proposed models enriched with this graph context, denoted as \u2018ZS3Net + GC\u2019\nin Table 3, show consistent improvements over the ZS3Net models. We note that for semantic\nsegmentation, the pixel accuracy metric (PA) is biased toward the more dominant classes and might\nsuggest misleading conclusions [12], as opposed to IoU metrics.\n\n8\n\n\fFigure 5: In\ufb02uence of parameter p on ZS5Net. Evo-\nlution of the mIoU performance as a function of per-\ncentage of high-scoring unseen pixels, on the 2-unseen\nclasses split from Pascal-VOC and Pascal-Context\ndatasets.\n\n(a) Input image\n\n(b) GT\n\n(c) ZS3Net\n\n(d) ZS5Net\n\nFigure 6: Zero shot segmentation with self-training. (a) Input image, (b) semantic segmentation ground-truth,\n(c) segmentation with ZS3Net, (d) result with additional self-training (ZS5Net). Unseen classes: motorbike\nsofa; Some seen classes car chair. Best viewed in color.\n\n3.3 Zero-shot segmentation with self-training\n\nWe report performance of ZS5Net (ZS3Net with self-training) on Pascal-VOC and Pascal-Context\nin Table 4 (with different splits according to datasets). For performance comparison, the reader\nis referred to ZS3Net results in Tables 1 and 3. Through zero-shot cross-validation we \ufb01xed the\npercentage of high-scoring unseen pixels as p = 25% for Pascal-VOC and p = 75% for Pascal-\nContext. We show in Figure 5 the in\ufb02uence of this percentage on the \ufb01nal performance. In general,\nthe additional self-training step strongly boosts the performance in seen, unseen and all classes.\nRemarkably, on the 2-unseen split in both datasets, the overall performance in all metrics is very close\nto the supervised performance (reported in the \ufb01rst lines of Tables 1 and 3). Figure 6 shows semantic\nsegmentation results on Pascal-VOC and Pascal-Context datasets. On both cases, self-training helps\nto disambiguate pixels wrongly classi\ufb01ed as seen classes.\n\n4 Conclusion\n\nIn this work, we introduced a deep model to deal with the task of zero-shot semantic segmentation.\nBased on zero-shot classi\ufb01cation, our ZS3Net model combines rich text and image embeddings,\ngenerative modeling and classic classi\ufb01ers to learn how to segment objects from already seen classes\nas well as from new, never-seen ones at test time. First of its kind, proposed ZS3Net shows good\nbehavior on the task of zero shot semantic segmentation, setting competitive baselines on various\nbenchmarks. We also introduced a self-training extension of the approach for scenarios where\nunlabelled pixels from unseen classes are available at training time. Finally, a graph-context encoding\nhas been used to improve the semantic class representation of ZS3Net when facing complex scenes.\n\n9\n\n\fReferences\n[1] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image\n\nclassi\ufb01cation. TPAMI, 2015.\n\n[2] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embed-\n\ndings for \ufb01ne-grained image classi\ufb01cation. In CVPR, 2015.\n\n[3] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder\n\narchitecture for image segmentation. TPAMI, 2017.\n\n[4] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. Zero-shot object\n\ndetection. In ECCV, 2018.\n\n[5] L\u00e9on Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT. 2010.\n\n[6] Maxime Bucher, St\u00e9phane Herbin, and Fr\u00e9d\u00e9ric Jurie. Improving semantic embedding consistency by\n\nmetric learning for zero-shot classif\ufb01cation. In ECCV, 2016.\n\n[7] Maxime Bucher, St\u00e9phane Herbin, and Fr\u00e9d\u00e9ric Jurie. Generating visual representations for zero-shot\n\nclassi\ufb01cation. In ICCV, 2017.\n\n[8] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha. Synthesized classi\ufb01ers for zero-shot\n\nlearning. In CVPR, 2016.\n\n[9] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and analysis of\n\ngeneralized zero-shot learning for object recognition in the wild. In ECCV, 2016.\n\n[10] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab:\nSemantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.\nTPAMI, 2017.\n\n[11] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder\n\nwith atrous separable convolution for semantic image segmentation. In ECCV, 2018.\n\n[12] Gabriela Csurka, Diane Larlus, Florent Perronnin, and France Meylan. What is a good evaluation measure\n\nfor semantic segmentation?. In BMVC, 2013.\n\n[13] Jifeng Dai, Kaiming He, and Jian Sun. Boxsup: Exploiting bounding boxes to supervise convolutional\n\nnetworks for semantic segmentation. In CVPR, 2015.\n\n[14] Berkan Demirel, Ramazan Gokberk Cinbis, and Nazli Ikizler-Cinbis. Zero-shot object detection by hybrid\n\nregion embedding. BMVC, 2018.\n\n[15] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew\n\nZisserman. The pascal visual object classes challenge: A retrospective. IJCV, 2015.\n\n[16] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A\n\ndeep visual-semantic embedding model. In NIPS, 2013.\n\n[17] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate\n\nobject detection and semantic segmentation. In CVPR, 2014.\n\n[18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[19] Bharath Hariharan, Pablo Arbel\u00e1ez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic\n\ncontours from inverse detectors. In ICCV, 2011.\n\n[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016.\n\n[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014.\n\n[22] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.\n\n[23] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nICLR, 2017.\n\n[24] Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning. In CVPR,\n\n2017.\n\n10\n\n\f[25] Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. Generalized zero-shot learning via\n\nsynthesized examples. In CVPR, 2018.\n\n[26] Yanan Li, Donghui Wang, Huanhang Hu, Yuetan Lin, and Yueting Zhuang. Zero-shot recognition using\n\ndual visual-semantic mapping paths. In CVPR, 2017.\n\n[27] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In ICML, 2015.\n\n[28] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\ntion. In CVPR, 2015.\n\n[29] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Recti\ufb01er nonlinearities improve neural network\n\nacoustic models. In ICML, 2013.\n\n[30] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of\n\nwords and phrases and their compositionality. In NIPS, 2013.\n\n[31] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel\nUrtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild.\nIn CVPR, 2014.\n\n[32] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome,\nGreg S Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. In\nICLR, 2013.\n\n[33] George Papandreou, Liang-Chieh Chen, Kevin P Murphy, and Alan L Yuille. Weakly-and semi-supervised\n\nlearning of a deep convolutional network for semantic image segmentation. In CVPR, 2015.\n\n[34] Deepak Pathak, Philipp Krahenbuhl, and Trevor Darrell. Constrained convolutional neural networks for\n\nweakly supervised segmentation. In ICCV, 2015.\n\n[35] Sha\ufb01n Rahman, Salman Khan, and Fatih Porikli. Zero-shot object detection: Learning to simultaneously\n\nrecognize and localize novel concepts. In ACCV, 2018.\n\n[36] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In\n\nICML, 2015.\n\n[37] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical\n\nimage segmentation. In MICCAI, 2015.\n\n[38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.\nIJCV, 2015.\n\n[39] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through\n\ncross-modal transfer. In NIPS, 2013.\n\n[40] Jie Song, Chengchao Shen, Yezhou Yang, Yang Liu, and Mingli Song. Transductive unbiased embedding\n\nfor zero-shot learning. In CVPR, 2018.\n\n[41] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\n\na simple way to prevent neural networks from over\ufb01tting. JMLR, 2014.\n\n[42] Vinay Kumar Verma and Piyush Rai. A simple exponential family framework for zero-shot learning. In\n\nECML PKDD, 2017.\n\n[43] Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. Latent\n\nembeddings for zero-shot classi\ufb01cation. In CVPR, 2016.\n\n[44] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning-a comprehensive\n\nevaluation of the good, the bad and the ugly. TPAMI, 2018.\n\n[45] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks for zero-shot\n\nlearning. In CVPR, 2018.\n\n[46] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. ICLR, 2016.\n\n[47] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit\n\nAgrawal. Context encoding for semantic segmentation. In CVPR, 2018.\n\n11\n\n\f[48] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via semantic similarity embedding. In ICCV,\n\n2015.\n\n[49] Ziming Zhang and Venkatesh Saligrama. Zero-shot recognition via structured prediction. In ECCV, 2016.\n\n[50] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing\n\nnetwork. In CVPR, 2017.\n\n[51] Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama. Zero shot detection. TCSVT, 2019.\n\n[52] Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-\n\nMadison Department of Computer Sciences, 2005.\n\n12\n\n\f", "award": [], "sourceid": 245, "authors": [{"given_name": "Maxime", "family_name": "Bucher", "institution": "Valeo.ai"}, {"given_name": "Tuan-Hung", "family_name": "VU", "institution": "Valeo.ai"}, {"given_name": "Matthieu", "family_name": "Cord", "institution": "Sorbonne University"}, {"given_name": "Patrick", "family_name": "P\u00e9rez", "institution": "Valeo.ai"}]}