{"title": "Category Anchor-Guided Unsupervised Domain Adaptation for Semantic Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 435, "page_last": 445, "abstract": "Unsupervised domain adaptation (UDA) aims to enhance the generalization capability of a certain model from a source domain to a target domain. UDA is of particular significance since no extra effort is devoted to annotating target domain samples. However, the different data distributions in the two domains, or \\emph{domain shift/discrepancy}, inevitably compromise the UDA performance. Although there has been a progress in matching the marginal distributions between two domains, the classifier favors the source domain features and makes incorrect predictions on the target domain due to category-agnostic feature alignment. In this paper, we propose a novel category anchor-guided (CAG) UDA model for semantic segmentation, which explicitly enforces category-aware feature alignment to learn shared discriminative features and classifiers simultaneously. First, the category-wise centroids of the source domain features are used as guided anchors to identify the active features in the target domain and also assign them pseudo-labels. Then, we leverage an anchor-based pixel-level distance loss and a discriminative loss to drive the intra-category features closer and the inter-category features further apart, respectively. Finally, we devise a stagewise training mechanism to reduce the error accumulation and adapt the proposed model progressively. Experiments on both the GTA5$\\rightarrow $Cityscapes and SYNTHIA$\\rightarrow $Cityscapes scenarios demonstrate the superiority of our CAG-UDA model over the state-of-the-art methods. The code is available at \\url{https://github.com/RogerZhangzz/CAG\\_UDA}.", "full_text": "Category Anchor-Guided Unsupervised Domain\n\nAdaptation for Semantic Segmentation\n\nQiming Zhang\u22171\n\nJing Zhang\u22171 Wei Liu2 Dacheng Tao1\n\n1UBTECH Sydney AI Centre, School of Computer Science, Faculty of Engineering\n\nThe University of Sydney, Darlington, NSW 2008, Australia\n\n2Tencent AI Lab, China\n\nqzha2506@uni.sydney.edu.au, jing.zhang1@sydney.edu.au\n\nwl2223@columbia.edu, dacheng.tao@sydney.edu.au\n\nAbstract\n\nUnsupervised domain adaptation (UDA) aims to enhance the generalization ca-\npability of a certain model from a source domain to a target domain. UDA is of\nparticular signi\ufb01cance since no extra effort is devoted to annotating target domain\nsamples. However, the different data distributions in the two domains, or domain\nshift/discrepancy, inevitably compromise the UDA performance. Although there\nhas been a progress in matching the marginal distributions between two domains,\nthe classi\ufb01er favors the source domain features and makes incorrect predictions\non the target domain due to category-agnostic feature alignment. In this paper, we\npropose a novel category anchor-guided (CAG) UDA model for semantic segmen-\ntation, which explicitly enforces category-aware feature alignment to learn shared\ndiscriminative features and classi\ufb01ers simultaneously. First, the category-wise\ncentroids of the source domain features are used as guided anchors to identify the\nactive features in the target domain and also assign them pseudo-labels. Then,\nwe leverage an anchor-based pixel-level distance loss and a discriminative loss\nto drive the intra-category features closer and the inter-category features further\napart, respectively. Finally, we devise a stagewise training mechanism to reduce the\nerror accumulation and adapt the proposed model progressively. Experiments on\nboth the GTA5\u2192Cityscapes and SYNTHIA\u2192Cityscapes scenarios demonstrate\nthe superiority of our CAG-UDA model over the state-of-the-art methods. The\ncode is available at https://github.com/RogerZhangzz/CAG_UDA.\n\n1\n\nIntroduction\n\nSemantic segmentation is a classical computer vision task that refers to assigning pixel-wise category\nlabels to a given image to facilitate downstream applications such as autonomous driving, video\nsurveillance, and image editing. The recent progress in semantic segmentation has been dominated\nby deep neural networks trained on large datasets. Despite their success, annotating labels at the pixel\nlevel is prohibitively expensive and time-consuming, e.g., about 90 minutes for a single image in\nthe Cityscapes dataset [8]. One economical alternative is to exploit computer graphics techniques to\nsimulate a virtual 3D environment and automatically generate images and labels, e.g., GTA5 [31] and\nSYNTHIA [32]. Although synthetic images have similar appearances to real images, there still exist\nsubtle differences in textures, layouts, colors, and illumination conditions [11, 42\u201344], which result\nin different data distributions, or domain discrepancy. Consequently, the performance of a certain\nmodel trained on synthetic datasets degrades drastically when applied to realistic scenes. To address\nthis issue, one promising approach is domain adaptation [1, 45, 15, 34, 36, 27, 33, 40, 47, 13] to\n\n\u2217indicates equal contributions.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\freduce the domain shift and learn a shared discriminative model for both domains. In this paper, we\ntackle the more challenging unsupervised domain adaptation (UDA) situation, where no labels are\navailable in the target domain during training.\nPrevious methods have tried to learn domain-invariant representations by matching the distributions\nbetween source and target domains at the appearance level [27, 34, 40, 13, 21], feature level [14, 27,\n3, 13], or output level [45, 36, 26]. However, even though matching the global marginal distributions\ncan bring the two domains closer, e.g., reaching a lower maximum mean discrepancy (MMD) [25]\nor a saddle point in the minimax game via adversarial learning [13], it does not guarantee that\nsamples from different categories in the target domain are properly separated, hence compromising\nthe generalization ability. To tackle this issue, one could instead consider category-aware feature\nalignment by matching the local joint distributions of features and categories [7, 19, 33]. Other\napproaches adopt the idea of self-training by generating pseudo-labels for samples in the target\ndomain and providing extra supervision to the classi\ufb01er [47, 21, 3]. Together with supervision from\nthe source domain, this enforces the network to simultaneously learn domain-invariant discriminative\nfeature representations and shared decision boundaries through back-propagation. The ideas of\nminimizing the entropy (uncertainty) of the output [39] or discrepancies between the outputs of two\nclassi\ufb01ers (voters) [26] have also been exploited to implicitly enforce category-level alignment.\nAlthough category-level alignment and self-training methods have produced some promising results,\nthere are still some outstanding issues that need to be addressed to further improve the adaptation\nperformance. For example, error-prone pseudo-labels will mislead the classi\ufb01er and accumulate\nerrors. Meanwhile, implicit category-level alignment may be affected by category imbalance. To deal\nwith these issues and take advantage of both approaches, here we propose a novel idea of category\nanchors, which facilitate both category-wise feature alignment and self-training. It is motivated by\nthe observation that features from the same category tend to be clustered together. Moreover, the\ncentroids of source domain features in each category can serve as explicit anchors to guide adaptation.\nSpeci\ufb01cally, we propose a novel category anchor-guided unsupervised domain adaptation model\n(CAG-UDA) for semantic segmentation. This model explicitly enforces category-wise feature\nalignment to learn shared feature representations and classi\ufb01ers for both domains simultaneously.\nFirst, the centroids of category-wise features in the source domain are used as anchors to identify\nthe active features in the target domain. Then, we assign pseudo-labels to these active features\naccording to the category of the closest anchor. Lastly, two loss functions are proposed: the \ufb01rst is a\npixel-level distance loss between the guiding anchors and active features, which pushes them closer\nand explicitly minimizes the intra-category feature variance; the other is a pixel-level discriminative\nloss to supervise the classi\ufb01er and maximize the inter-category feature variance. To reduce the error\naccumulation of incorrect pseudo-labels, we propose a stagewise training mechanism to adapt the\nmodel progressively.\nThe main contributions of this paper can be summarized as follows. First, we propose a novel category\nanchor idea to tackle the challenging UDA problem in semantic segmentation. Second, we propose a\nsimple yet effective category anchor-based method to identify active features in the target domain,\nfurther enabling category-wise feature alignment. Finally, the proposed CAG-UDA model achieves\nnew state-of-the-art performance in both GTA5\u2192Cityscapes and SYNTHIA\u2192Cityscapes scenarios.\n\n2 Related Work\n\nMany recent advances in computer vision [20, 12, 30, 11, 24, 46, 5] have been based on deep neural\nnetworks trained on large-scale labeled datasets such as ImageNet [9], Pascal VOC [10], MS COCO\n[22], and Cityscapes [8]. However, a domain shift between training data and testing data impairs\nmodel performance [29, 17, 18]. To overcome this issue, a variety of domain adaptation methods for\nclassi\ufb01cation [6, 23, 37, 28, 41, 3, 19], detection [38, 16], and segmentation [7, 14, 13, 27, 34, 40, 21,\n47] have been proposed. In this paper, we focus on the challenging semantic segmentation problem.\nThe current mainstream approaches include style transfer [27, 34, 40, 13, 21], feature alignment\n[7, 14, 13], and self-training [47, 21]. As our work is most related to the latter two approaches, we\nbrie\ufb02y review and discuss their characteristics.\nFeature distribution alignment: Previous methods that match the global marginal distributions be-\ntween two domains [14, 13, 27] do not distinguish local category-wise feature distribution shifts.\nConsequently, error-prone predictions are made for misaligned features with shared decision bound-\n\n2\n\n\faries. In contrast to these methods, we propose a category-wise feature alignment method to explicitly\nreduce category-level mismatches and learn discriminative domain-invariant features. The idea of\ncategory-level feature alignment was also exploited in [26, 33] for semantic segmentation. Luo\net al. proposed a weighted adversarial learning method to align the category-level feature distri-\nbutions implicitly [26]. Saito et al. tried to align the feature distributions and learn discriminative\ndomain-invariant features by utilizing task-speci\ufb01c classi\ufb01ers as a discriminator [33]. In contrast\nto the implicit feature alignment in the aforementioned methods, we propose a novel category\nanchor-guided method, which directly aligns category-wise features in both domains.\nPseudo-label assignment: Assigning pseudo-labels to target domain samples based on the trained\nclassi\ufb01ers helps adapt the feature extractor and classi\ufb01er to the target domain. Zou et al. [47] proposed\nan iterative self-training UDA model by alternatively generating pseudo-labels and retraining the\nmodel. They also dealt with the category imbalance issue by controlling the proportion of selected\npseudo-labels in each category [47]. Li et al. [21] proposed a bidirectional learning domain adaptation\nmodel that alternately trains the image translation model and the self-supervised segmentation\nadaptation model. In contrast to these methods, where pseudo-labels were determined according\nto the predicted category probability, we propose a category anchor-based method to generate\ntrustable pseudo-labels. Compared with selected samples that have been \u201ccorrectly\u201d classi\ufb01ed with\nhigh con\ufb01dence, our selected samples are not determined by the decision boundaries so are more\ninformative for the classi\ufb01er to further adapt to the target domain.\nThe idea of assigning pseudo-labels based on category centers has also been utilized in domain\nadaptation for classi\ufb01cation, e.g., category centroids in [41], prototypes in [3], and cluster centers\nin [19]. The former two methods minimize the distance loss against category centroids, while the\nthird minimizes contrastive domain discrepancies. Our method differs from these methods in several\nways. First, we tackle the more challenging task of image semantic segmentation rather than image\nclassi\ufb01cation, where dense pixel-wise labels need to be predicted as not just single labels for entire\nimages. Second, we \ufb01x the category centroids (hence called category anchors) instead of updating\nthem at each iteration. On one hand, the mini-batch size used for segmentation (e.g., 1) in this paper\nis much smaller than that used for classi\ufb01cation. On the other hand, pixels are spatially coherent\nin an image, so the category centroids calculated at each iteration will be biased and unreliable due\nto the dominance of homogeneous features. Third, the pseudo-labels of target domain samples are\ndetermined by their distance against the category centroids from the source domain instead of the\ntarget domain. This is reasonable since: 1) the source domain category centroids are calculated from\nall training samples based on ground-truth labels, which are reliable; 2) driving the target domain\nfeatures towards the source domain category centroids can effectively reduce the domain discrepancy.\nFourth, together with the category anchor-based distance loss, we also add the segmentation loss\nbased on the pseudo-labeled target samples to learn discriminative feature representations and adapt\nthe decision boundaries simultaneously.\n\n3 A category anchor-guided UDA model for semantic segmentation\n\n3.1 Problem Formulation\n\nSupervised semantic segmentation: A semantic segmentation model M can be formulated as a\nmapping function from the image domain X to the output label domain Y :\n\n(1)\nwhich predicts a pixel-wise category label \u02c6y close to the ground-truth annotation y \u2208 Y for a given\nimage x \u2208 X. Usually, the segmentation model M is trained in a supervised manner by minimizing\nthe difference between the prediction \u02c6y and its ground-truth y for every training sample x. The\ncross-entropy (CE) loss is widely used as a measurement, which is de\ufb01ned as:\n\nM : X \u2192 Y,\n\nLCE = \u2212 N(cid:88)\n\nH\u00d7W(cid:88)\n\nC(cid:88)\n\nyijclog (pijc) ,\n\n(2)\n\ni=1\n\nj=1\n\nc=1\n\nthe ground-truth label, i.e., \u2200i, j,(cid:80)\n\nwhere N is the number of training images, H and W denote the image size, j is the pixel index, C is\nthe number of categories, c is the category index, yijc \u2208 {0, 1} is the one-hot vector representation of\nc yijc = 1, and pijc is the predicted category probability by M.\nUDA for semantic segmentation: Generally, a segmentation model trained on a source domain Xs\nhas a limited generalization capability to a target domain Xt, when the distributions between Xs and\n\n3\n\n\fFigure 1: An illustration of the proposed category anchor-guided UDA model for semantic seg-\nmentation. (a) The architecture of the proposed CAG-UDA model consists of an encoder, a feature\ntransformer (fD), and a classi\ufb01er. The green part denotes the source domain \ufb02ow while the orange\nparts represent the target domain \ufb02ow. (b) The illustration of the process of active target sample\nidenti\ufb01cation and pseudo label assignment described in Section 3.2. (c) The illustration of the\nproposed category-wise feature alignment with the anchor-based pixel-level distance loss Ldis and\ncross-entropy loss LCE described in Section 3.3. Best viewed in color.\n\nXt are different, i.e., there is a domain shift/discrepancy. Several unsupervised domain adaptation\nmodels have been proposed, which can be formulated as the following mapping function:\n\nMuda : Xs \u222a Xt \u2192 Ys \u222a Yt,\n\n(3)\nwhere Muda is trained on the labeled training samples (Xs, Ys) in the source domain together with\nthe training unlabeled samples Xt in the target domain. Typically, the aforementioned CE loss and\nsome domain-adaptation losses are used to align the distributions of both domains (e.g., p (Xs) and\np (Ys)) and to learn domain-invariant discriminative feature representations.\nModel components: The main semantic segmentation approaches have been based on fully con-\nvolutional neural networks (CNNs) since the seminal work in [24]. Usually, a DCNN-based model\nhas two parts: an encoder Enc and a decoder Dec, where the encoder maps the input image into a\nlow-dimensional feature space and then the decoder decodes it to the label space. The decoder can be\nfurther divided into a feature transformation net fD and a classi\ufb01er Cls, where Cls denotes the last\nclassi\ufb01cation layer and fD denotes the remaining part in Dec. Typical encoders are the classi\ufb01cation\nnetworks pretrained on ImageNet [9], e.g., VGGNet [35] and ResNet [12]. The decoder consists of\nconvolutional layers responsible for context modeling, multi-scale feature fusion, etc. UDA methods\ntypically employ a segmentation model with carefully designed modules for domain adaptation.\n\n3.2 Network Architecture\n\nThe network architecture of our proposed CAG-UDA model is shown in Figure 1(a). The CAG-UDA\nmodel employs Deeplab v2 [4] as the base segmentation model, where ResNet-101 is used as the\nencoder Enc and the ASPP module is used in the decoder Dec. To reduce the domain shift, we\ndevise a category anchor-guided alignment module on the features from fD, consisting of category\nanchor construction (CAC), active target sample identi\ufb01cation (ATI), and pseudo-label assignment\n(PLA) as shown in Figure 1(b). The details are as follows.\nCategory anchor construction (CAC): Based on the observation that pixels in the same category\ncluster in the feature space, we propose to calculate the centroids of the features of each category in\nthe source domain as a representative of the feature distribution, i.e., the mean. Considering that the\nfeatures fed into the classi\ufb01er directly relate to the decision boundaries, we choose the features from\nfD to calculate these centroids. Mathematically, this can be written as:\ni ))|j) ,\n\nys\nijc (fD (Enc (xs\n\nH\u00d7W(cid:88)\n\nN(cid:88)\n\nf s\nc =\n\n(4)\n\n1\n|\u039bs\nc|\n\ni=1\n\nj\n\nwhere \u039bs\nto the cth category, i.e., \u039bs\n\nc is the index set of all pixels on the training images in the source domain Xs belonging\nc, i.e.,\n\nc| denotes the number of pixels in \u039bs\n\nc = {(i, j)|ys\n\nijc = 1}, |\u039bs\n\n4\n\n\ud835\udc46\ud835\udc47classifierencoder\ud835\udc53\ud835\udc37\ud835\udc46\ud835\udc47\ud835\udc46\ud835\udc47PredictionGroundtruthPseudolabelsPredictionImagesImagesLoss\ud835\udc3f\ud835\udc51\ud835\udc56\ud835\udc60\ud835\udc46CategoryanchorSourcesampleLoss\ud835\udc3f\ud835\udc51\ud835\udc56\ud835\udc60\ud835\udc47CategoryanchorActivesampleCAG architectureATIandPLACategoryanchors\u2026\u20261219active(CA1)FeaturespaceTargetsamplesinactiveactive(CA19)(CA19)(CA1)\u2206\ud835\udc51hyperbola:\ud835\udc511\u2212\ud835\udc5119=\u0394\ud835\udc51DomainadaptationLoss\ud835\udc3f\ud835\udc36\ud835\udc38\ud835\udc47Loss\ud835\udc3f\ud835\udc36\ud835\udc38\ud835\udc46Global marginal distributionsCAG category-wise alignment(b)(a)(c)\ud835\udc3f;<\ud835\udc3f\ud835\udc51\ud835\udc56\ud835\udc60TargetsampleSourcesampleDecisionboundary\fc| =(cid:80)N\n\n(cid:80)H\u00d7W\n\nj\n\ni=1\n\nyijc, and fD (xs\n\ni )|j is the feature vector at index j on the feature map fD (xs\n|\u039bs\ni ).\nIt is noteworthy that we calculate the category centroids at the beginning of each training stage\nand then keep them \ufb01xed during training (we propose a stagewise training mechanism in Section\n3.4.). Therefore, we call these centroids category anchors (CAs) in this paper, i.e., CA = {f s\nc , c =\n1, ..., C}.\nActive target sample identi\ufb01cation (ATI): To align the category-wise feature distributions between\ntwo domains, we expect that the category centroids from the target domain get closer to the category\nanchors during training. However, on one hand, target sample labels are unavailable. On the other\nhand, the calculated centroids on target samples are very unstable at each iteration since the mini-\nbatch size is very small (i.e., 1) in this paper and image pixels are spatially coherent. To tackle\nthese issues, we propose identifying active target samples and assigning them pseudo-labels for the\nsubsequent feature alignment. The term \u201cactive target samples\u201d refers to target samples near one\ncategory anchor and far from the other anchors, i.e., being activated by one speci\ufb01c category anchor.\nMathematically, this can be formulated as follows. We \ufb01rst de\ufb01ne the distance between a target\nfeature fD (Enc (xt\n\n(cid:0)Enc(cid:0)xt\ni))|j and the cth category anchor as\nwhere (cid:107)\u00b7(cid:107)2 is the L2 norm of a vector. Then, we sort {dt\ncompare the shortest distance dt\nprede\ufb01ned threshold (cid:52)d, we identify this target sample as active one, i.e.,\n\n(cid:1)(cid:1)|j\n(5)\nijc, c = 1, ..., C} in an ascending order and\n(cid:48) . If their difference is larger than a\n\nijc\u2217 with the second shortest dt\n\nijc =(cid:13)(cid:13)f s\n\nc \u2212 fD\n\n(cid:13)(cid:13)2\n\ndt\n\nijc\n\n,\n\ni\n\n(cid:48) \u2212 dt\ndt\nijc\notherwise,\n\nijc\u2217 > (cid:52)d,\n\n0,\n\nat\nij =\n\ni according to its closest category anchor f s\n\nij denotes the active state of the target feature fD (Enc (xt\n\n(6)\ni))|j. Like the category anchors,\nwhere at\nwe calculate the active states at the beginning of each training stage and keep them \ufb01xed during\nsubsequent stages. This is explained in Section 3.4, where we introduce a stagewise training\nmechanism.\nPseudo-label assignment (PLA): After we obtain the active state according to Eq. (6), a pseudo\nc\u2217 with a reliable margin (cid:52)d:\nlabel c\u2217 can be assigned to xt\n(7)\nDue to the lack of the target domain labels, the classi\ufb01er layer is biased to the source domain and\ndoes not generalize well to the target domain, as shown in Figure 1(c). Consequently, some of the\npseudo-labels from predicted probabilities may be error-prone. However, based on the observation\nof the intra-category clustering characteristics, the generated pseudo-labels via category anchors\nare independent of the biased classi\ufb01er and are thus more reliable than those assigned by predicted\ncategory probabilities. Further, considering that high-probability samples have been \u201ccorrectly\u201d\nclassi\ufb01ed by the classi\ufb01er layer with high con\ufb01dence, these samples provide only weak supervision\nsignals. In contrast, active samples are more informative for adapting the classi\ufb01er to the target\ndomain as the classi\ufb01er layer may not predict these active samples with high probabilities.\n\nijc \u2212 (cid:52)d,\u2200c (cid:54)= c\n\nijc\u2217 = 1, if dt\n\u02c6yt\n\nijc\u2217 < dt\n\n\u2217\n\n.\n\n(cid:26) 1,\n\n3.3 Objective Functions\n\nWhen training the CAG-UDA model, we leverage a CE loss Ls\npropose a category-wise distance loss Ls\nlosses on the active target samples, i.e., a CE loss Lt\non the pseudo-labels, to guide the adaptation process. These are de\ufb01ned as:\n\nCE as de\ufb01ned in Eq. (2). We also\ndis on the source domain samples and two domain adaptation\ndis based\n\nCE and a category-wise distance loss Lt\n\n(8)\n\n(9)\n\n(10)\n\nLs\n\ndis =\n\nijc (cid:107)f s\nys\n\nc \u2212 fD (Enc (xs\n\ni ))|j(cid:107)2 ,\n\nat\nij\n\n\u02c6yt\n\nC(cid:88)\n(cid:13)(cid:13)f s\n\nc=1\n\nijclog(cid:0)pt\n(cid:1) ,\n(cid:1)(cid:1)|j\n(cid:0)Enc(cid:0)xt\n\nijc\n\ni\n\n(cid:13)(cid:13)2\n\n.\n\nLt\n\ndis =\n\nat\nij\n\n\u02c6yt\nijc\n\nc \u2212 fD\n\ni=1\n\nc=1\n\nj=1\n\nN(cid:88)\nH\u00d7W(cid:88)\nC(cid:88)\nH\u00d7W(cid:88)\nCE = \u2212 M(cid:88)\nC(cid:88)\nH\u00d7W(cid:88)\nM(cid:88)\n\nj=1\n\ni=1\n\nLt\n\ni=1\n\nj=1\n\nc=1\n\n5\n\n\fAlthough only the active samples are directly driven towards the category anchors by Lt\ndis, other\ninactive target samples within each category may also follow the active samples due to being\nclustered. Therefore, minimizing Lt\ndis indeed reduces the intra-category variances in the target\ndomain. Meanwhile, Lt\nCE leverages the pseudo-labels to update the network weights together with\nthe source domain CE loss, prompting the encoder, decoder, and classi\ufb01er to adapt to the target\ndomain and therefore reducing the intra- and inter-category variances simultaneously. The illustration\nis show in Figure 1(c). To leverage the complementarity between the proposed category anchor-based\nPLA and category probability-based PLA in [47], we also identify active target samples based on the\npredicted category probability and add an extra CE loss LtP\n\nCE similar to Eq. (9).\n\n(11)\n\nCE = \u2212 M(cid:88)\n\nLtP\n\nH\u00d7W(cid:88)\n\nC(cid:88)\n\natP\nij\n\n\u02c6ytP\n\ni=1\n\nj=1\n\nc=1\n\n(cid:0)Ls\n\n(cid:1) + \u03bb2\n\n(cid:1) ,\n\nijc\n\nijclog(cid:0)pt\n(cid:16)\n\n(cid:17)\n\nwhere atP\nThen the \ufb01nal objective function is as follows:\n\nij , ytP\n\nijc refer to the probability-based active state and assigned pseudo-labels respectively.\n\ndis + Lt\n\ndis\n\nLt\n\nCE + LtP\nCE\n\n,\n\n(12)\n\nL = Ls\n\nCE + \u03bb1\n\nwhere \u03bb1 and \u03bb2 are loss weights.\n\n3.4 Stagewise Training Procedure\n\nWe tried to train the CAG-UDA model in a single stage and update the pseudo-labels at each iteration.\nHowever, it is not stable because there are some error-prone pseudo-labels, which may produce\nincorrect supervision signals, lead to more erroneous pseudo-labels iteratively and trap the network\nto a local minimum with poor performance eventually, e.g. less than 30 mIoU. To address this issue,\nwe propose a stagewise training mechanism as summarized in Algorithm 1. First, we pretrain the\nsegmentation model on the source domain. Then, we leverage the global feature alignment method\nin [14] to warm up the training process and obtain a well-initialized model. Next, we train the\nCAG-UDA model with the proposed losses for several stages. At the beginning of each stage, we\ncalculate the CAs, identify the active target samples, and assign pseudo-labels to them. By using this\nstagewise delayed updating mechanism, we avoid updating the pseudo-labels at each iteration and\nCE serve as two regularizations on the network.\nreduce the error accumulation. Hence, Lt\n\ndis and Lt\n\nAlgorithm 1 Stagewise training the CAG-UDA model\nInput: training dataset: (Xs, Ys, Xt), maximum stages: K, maximum iterations: L, distance\n\nthreshold: (cid:52)d.\n\n0 according to [14];\n\n0 \u2190 (Xs, Ys) according to [4];\n\nOutput: MK and ( \u02c6Ys, \u02c6Yt).\n1: Pretraining: M p\n2: Warm-up: M0 \u2190 (Xs, Ys) and M p\n3: for k \u2190 1 to K do\nc } \u2190 Mk\u22121 and (Xs, Ys) according to Eq. (4);\nCAC: {f s\n4:\nATI: {dt\nijc},{at\n5:\nijc\u2217} \u2190 {dt\nPLA: {\u02c6yt\n6:\nfor n \u2190 1 to L do\n7:\n8:\n9:\n10:\n11: end for\n12: Prediction: ( \u02c6Ys, \u02c6Yt) \u2190 (Xs, Xt) and MK.\n\nij} \u2190 Mk\u22121, (Xs, Ys, Xt), {f s\nijc},(cid:52)d according to Eq. (7);\n\nSGD: training Mk\u22121 on (Xs, Ys, Xt, {\u02c6yt\n\nend for\nMk \u2190 Mk\u22121\n\nc } and (cid:52)d according to Eq. (5) and Eq. (6);\n\nijc\u2217}, {f s\n\nc }, {at\n\nij}) according to Eq.(12);\n\n4 Experiments\n\n4.1 Experimental Settings\n\nDatasets and evaluation metrics: Following [21], we evaluate the CAG-UDA model in two com-\nmon scenarios, GTA5[31]\u2192Cityscapes[8] and SYNTHIA[32]\u2192Cityscapes[8]. GTA5 contains\n\n6\n\n\fTable 1: Results of the CAG-UDA model and SOTA methods ( GTA5\u2192Cityscapes).\n\nk\nl\na\nw\ne\nd\ni\ns\n\n16.8\n25.9\n22.3\n30.8\n16.8\n27.1\n33.1\n47.5\n35.6\n12.7\n44.7\n\n25.4\n51.6\n\ng\nn\ni\nd\nl\ni\nu\nb\n\n77.2\n79.8\n75.6\n81.3\n77.2\n79.6\n81.0\n82.5\n80.1\n69.5\n84.2\n74.7\n83.8\n\nd\na\no\nr\n\n75.8\n86.5\n69.9\n85.0\n75.8\n87.0\n89.4\n91.5\n86.7\n69.0\n91.0\n\n69.8\n90.4\n\nl\nl\na\nw\n\n12.5\n22.1\n15.8\n25.8\n12.5\n27.3\n26.6\n31.3\n19.8\n9.9\n34.6\n11.3\n34.2\n\ne\nc\nn\ne\nf\n\n21.0\n20.0\n20.1\n21.2\n21.0\n23.3\n26.8\n25.6\n17.5\n19.5\n27.6\n\n18.3\n27.8\n\ne\nl\no\np\n\n25.5\n23.6\n18.8\n22.2\n25.5\n28.3\n27.2\n33.0\n38.0\n22.8\n30.2\n\n24.2\n38.4\n\nt\nh\ng\ni\nl\n\n30.1\n33.1\n28.2\n25.4\n30.1\n35.5\n33.5\n33.7\n39.9\n31.7\n36.0\n\n35.6\n25.3\n\nn\ng\ni\ns\n\n20.1\n21.8\n17.1\n26.6\n20.1\n24.2\n24.7\n25.8\n41.5\n15.3\n36.0\n\n23.3\n48.4\n\n.\ne\ng\ne\nv\n\n81.3\n81.8\n75.6\n83.4\n81.3\n83.6\n83.9\n82.7\n82.7\n73.9\n85.0\n\n72.0\n85.4\n\ne\nc\na\nr\nr\ne\nt\n\n24.6\n25.9\n8.00\n36.7\n24.6\n27.4\n36.7\n28.8\n27.9\n11.3\n43.6\n14.4\n38.2\n\nn\no\ns\nr\ne\np\n\n53.8\n57.3\n55.0\n58.9\n53.8\n58.6\n58.7\n62.4\n64.9\n54.7\n58.6\n\n58.7\n58.6\n\ny\nk\ns\n\n70.3\n75.9\n73.5\n76.2\n70.3\n74.2\n78.8\n82.7\n73.6\n67.2\n83.0\n65.3\n78.1\n\nr\ne\nd\ni\nr\n\n26.4\n26.2\n2.9\n24.9\n26.4\n28.0\n30.5\n30.8\n19.0\n23.9\n31.6\n\n29.0\n34.6\n\nr\na\nc\n\n49.9\n76.3\n66.9\n80.7\n49.9\n76.2\n84.8\n85.2\n65.0\n53.4\n83.3\n\n53.1\n84.7\n\nk\nc\nu\nr\nt\n\n17.2\n29.8\n34.4\n29.5\n17.2\n33.1\n38.5\n27.7\n12.0\n29.7\n35.3\n\n14.3\n21.9\n\ns\nu\nb\n\n25.9\n32.1\n30.8\n42.9\n25.9\n36.7\n44.5\n34.5\n28.6\n4.6\n49.7\n19.2\n42.7\n\nn\ni\na\nr\nt\n\n6.5\n7.2\n0.00\n2.50\n6.5\n6.7\n1.7\n6.4\n4.5\n11.6\n3.3\n\n7.9\n41.1\n\nr\no\nt\no\nm\n\n25.3\n29.5\n18.4\n26.9\n25.3\n31.9\n31.6\n25.2\n31.1\n26.1\n28.8\n\n15.1\n29.3\n\ne\nk\ni\nb\n\n36.0\n32.5\n0.00\n11.6\n36.0\n31.4\n32.4\n24.4\n42.0\n32.5\n35.6\n\n16.3\n37.2\n\nmIoU\n36.6\n41.4\n33.3\n41.7\n36.6\n43.2\n45.5\n45.4\n42.7\n33.6\n48.5\n\n34.1\n50.2\n\nSource only\n\nAdaptSegNet[36]\n\nSource only\nDCAN[40]\nSource only\nCLAN[26]\nAdvEnt[39]\n\nDISE[2]\n\nCycada[13, 21]\n\nSource only\n\nBLF[21]\n\nSource only\nCAG-UDA\n\nTable 2: Results of the CAG-UDA model on the testing set ( GTA5\u2192Cityscapes).\n\nk\nl\na\nw\ne\nd\ni\ns\n\ng\nn\ni\nd\nl\ni\nu\nb\n\nd\na\no\nr\n\nl\nl\na\nw\n\ne\nc\nn\ne\nf\n\ne\nl\no\np\n\nt\nh\ng\ni\nl\n\nn\ng\ni\ns\n\n.\ne\ng\ne\nv\n\ne\nc\na\nr\nr\ne\nt\n\nn\no\ns\nr\ne\np\n\ny\nk\ns\n\nr\ne\nd\ni\nr\n\nr\na\nc\n\nk\nc\nu\nr\nt\n\ns\nu\nb\n\nn\ni\na\nr\nt\n\nr\no\nt\no\nm\n\ne\nk\ni\nb\n\nCAG-UDA 93.2\n\n57.0\n\n85.6\n\n35.7\n\n25.1\n\n37.5\n\n30.8\n\n45.3\n\n87.1\n\n50.1\n\n89.4\n\n62.7\n\n40.8\n\n87.8\n\n18.0\n\n32.4\n\n34.5\n\n34.4\n\n35.4\n\nmIoU\n51.7\n\n24,966 1914\u00d71052-pixel images and has the same 19 category annotations as Cityscapes. SYN-\nTHIA contains 9,400 1914\u00d71052-pixel images and only has 16 common category annotations.\nCityscapes is divided into a training set, a validation set, and a testing set. The training set con-\nsists of 2,957 2048\u00d71024-pixel images and the validation set contains 500 images at the same\nresolution. Following common practice, we report the results on the Cityscapes validation set,\nspeci\ufb01cally, the category-wise intersection over union (IoU). Moreover, we also report the mean IoU\n(mIoU) of all 19 categories in the GTA5\u2192Cityscapes scenario and the 16 common categories in the\nSYNTHIA\u2192Cityscapes scenario. Some methods [36, 26, 21] only reported mIoU for 13 common\ncategories in the SYNTHIA\u2192Cityscapes scenario, denoted as mIoU* in this paper.\nImplementation details: In our experiments, training images were randomly cropped to 1280\u00d7640\npixels after being randomly resized by \u00d71 \u223c \u00d71.5. Due to GPU memory limitations, the batch size\nwas set to 1 and the weights of all batch normalization layers were frozen. In the warm-up phase,\nwe used a CNN-based domain discriminator comprising 5 convolutional layers of kernel size 3\u00d73,\n\ufb01lter numbers [64, 128, 256, 512, 1], and stride 2. The \ufb01rst three convolutional layers are followed by\na ReLU layer, while the fourth layer is followed by a leaky ReLU layer parameterized by 0.2. We\nused a CE loss and an adversarial loss to train the model for 20 epochs. The adversarial loss weights\nwere set to 1e-2. In the stagewise training phase, we trained the CAG-UDA mode for 20 epochs\nwith the SGD optimizer. The initial learning rate was 2.5e-4, which decayed by the poly policy with\npower 0.9. The weight decay, momentum, \u03bb1, and \u03bb2 were set to 1e-4, 0.9, 0.3, and 0.7, respectively.\n(cid:52)d was set to 2.5. We also assigned pseudo-labels based on predicted category probabilities, and\nthe threshold P0 was set to 0.95. Experiments were conducted on a TITAN Tesla V100 GPU with\nPyTorch implementation. Code will be made publicly available.\n\n4.2 Main Results\nQuantitative results: The results of the GTA5\u2192Cityscapes scenario are presented in Table 1 with\nthe best results highlighted in bold. All the models adopted ResNet-101 as a backbone network for\nfair comparison. Overall, our CAG-UDA model strikingly outperforms all other models with a 50.2\nmIoU, surpassing the model trained on the source domain by a signi\ufb01cant gain of 16.1. Compared\nwith CLAN [26] and DISE [2], which implicitly align category-level features, our model achieves an\nextra gain of 4.5 and outperforms them on fence, traf\ufb01c sign, rider, train, and bike by large margins.\nThis is due to the proposed category anchor-guided alignment method, which explicitly uses category\ncentroids as representatives of feature distributions, reducing the side effect of category imbalance.\nLike [40, 13], BLF in [21] also involves a style-transfer module but combines it with self-training\nin a bidirectional learning framework. It achieved the second-best mIoU of 48.5. BLF achieves\nbetter results than the CAG-UDA model on stuff categories such as road, building, wall, terrace,\nand sky but is inferior to the CAG-UDA model for small objects. This is because BLF includes a\n\n7\n\n\fTable 3: Results of the CAG-UDA model and SOTA methods ( SYNTHIA\u2192Cityscapes).\n\nk\nl\na\nw\ne\nd\ni\ns\n\n37.2\n37.0\n46.7\n41.7\n\n36.4\n53.5\n42.2\n40.8\n\ng\nn\ni\nd\nl\ni\nu\nb\n\n78.8\n80.1\n80.3\n85.5\n75.7\n77.1\n79.7\n81.7\n\nd\na\no\nr\n\n79.2\n81.3\n86.0\n84.8\n\n82.8\n91.7\n85.6\n84.7\n\nl\nl\na\nw\n\ne\nc\nn\ne\nf\n\ne\nl\no\np\n\n-\n-\n-\n-\n\n5.1\n2.5\n8.7\n7.8\n\n-\n-\n-\n-\n\n0.1\n0.2\n0.4\n0.0\n\n-\n-\n-\n-\n\n25.8\n27.1\n25.9\n35.1\n\nt\nh\ng\ni\nl\n\n9.9\n16.1\n14.1\n13.7\n\n8.0\n6.2\n5.4\n13.3\n\nn\ng\ni\ns\n\n10.5\n13.7\n11.6\n23.0\n18.7\n7.6\n8.1\n22.7\n\ne\nl\nb\na\nt\ne\ng\ne\nv\n\n78.2\n78.2\n79.2\n86.5\n74.7\n78.4\n80.4\n84.5\n\nn\no\ns\nr\ne\np\n\n53.5\n53.4\n54.1\n66.3\n51.1\n55.8\n57.9\n64.2\n\nr\ne\nd\ni\nr\n\n19.6\n21.2\n27.9\n28.1\n15.9\n19.2\n23.8\n27.8\n\ny\nk\ns\n\n80.5\n81.5\n81.3\n78.1\n\n76.9\n81.2\n84.1\n77.6\n\nr\na\nc\n\n67.0\n73.0\n73.7\n81.8\n\n77.7\n82.3\n73.3\n80.9\n\ns\nu\nb\n\n29.5\n32.9\n42.2\n21.8\n\n24.8\n30.3\n36.4\n19.7\n\nr\no\nt\no\nm\n\n21.6\n22.6\n25.7\n22.9\n\n4.1\n17.1\n14.2\n22.7\n\ne\nk\ni\nb\n\n31.3\n30.7\n45.3\n49.0\n37.3\n34.3\n33.0\n48.3\n\nmIoU mIoU*\n45.9\n47.8\n51.4\n52.6\n\n-\n-\n-\n-\n\n38.4\n41.5\n41.2\n44.5\n\n-\n-\n-\n-\n\nAdaptSegNet[36]\n\nCLAN[26]\nBLF[21]\n\nCAG-UDA(13)\n\nDCAN[40]\nDISE[2]\n\nAdvEnt[39]\n\nCAG-UDA(16)\n\nFigure 2: (a) Subjective evaluation of the CAG-UDA model on some images from the Cityscapes\nvalidation set. (b) Comparison between probability-based PLA and the proposed CAs-based PLA on\nan image from the Cityscapes training set. Best viewed in color and zoom-in.\n\nstyle-transfer module that bene\ufb01ts from the texture clues in the stuff categories and assigns reliable\npseudo-labels accordingly. In contrast, CAG-UDA uses a category-anchor guided method that can\ntackle the category imbalance and generate more informative pseudo-labels, leading to better results\non more categories.\nWe also present the result on the testing set of the Cityscapes dataset in Table 2. The CAG-UDA\nmodel reaches 51.7 mIoU, proving the good generalization of our method.\nResults in the SYNTHIA\u2192Cityscapes scenario are listed in Table 3. Same as the previous work, we\nreport the performance of the CAG-UDA model in two mIoU metrics: 13 categories (mIoU*) and 16\ncategories (mIoU) for fair comparisons. Since the domain shift is much larger than the above scenario,\nthe performance is slightly worse. The CAG-UDA model still achieves better results than all previous\nSOTA methods, including CLAN, BLF, etc. Similar to the above discussions with the GTA5 dataset,\nthe superiority of the CAG-UDA model remains in small objects like pole, sign, person, and bike.\nQualitative results: Some qualitative segmentation examples are given in Figure 2(a). Training\nmerely on the source domain dataset leads to a limited generalization ability, e.g., the road and person\nwere incorrectly predicted as sidewalk and building in the \ufb01rst row. Bene\ufb01ted from the category\nanchor-guided adaptation, the proposed CAG-UDA model achieves better results, especially for small\nobjects, e.g., pole, sign, and person. Besides, we also attribute it to the proposed CAs-based pseudo\nlabel assignment, which successfully activated small objects and assigned them trustable pseudo-\nlabels, as highlighted in red circles in Figure 2(b). More results can be found in the supplement.\n\n8\n\n\fTable 4: Results of ablation study (GTA5\u2192Cityscapes).\n\nd\na\no\nr\n\n69.8\n88.4\n88.8\n88.3\n89.4\n88.9\n88.1\n88.9\n88.8\n90.4\n90.4\n\n.\ne\nd\ni\ns\n\n25.4\n45.2\n45.5\n46.9\n40.1\n41.7\n46.6\n47.1\n47.5\n50.6\n51.6\n\n.\nl\ni\nu\nb\n\n74.7\n82.0\n83.7\n81.5\n81.8\n82.0\n82.1\n83.0\n83.6\n84.0\n83.8\n\nl\nl\na\nw\n\n11.3\n30.1\n33.2\n28.7\n31.0\n31.7\n30.2\n31.0\n31.7\n33.5\n34.2\n\n.\nc\nn\ne\nf\n\n18.3\n22.0\n21.4\n27.7\n22.6\n22.5\n28.4\n27.3\n29.1\n28.3\n27.8\n\ne\nl\no\np\n\n24.2\n35.4\n39.5\n38.9\n39.9\n39.7\n39.7\n39.7\n39.7\n39.9\n38.4\n\nt\nh\ng\ni\nl\n\n35.6\n36.7\n40.0\n27.0\n41.2\n41.2\n31.3\n31.0\n34.4\n31.6\n25.3\n\nn\ng\ni\ns\n\n23.3\n23.7\n25.9\n40.4\n23.2\n23.5\n38.8\n36.0\n35.6\n42.4\n48.4\n\n.\ne\ng\ne\nv\n\n72.0\n82.7\n83.9\n83.7\n83.0\n82.7\n83.6\n84.3\n84.4\n85.1\n85.4\n\n.\nr\nr\ne\nt\n\n14.4\n27.6\n33.8\n31.2\n28.3\n27.0\n30.7\n32.6\n33.0\n35.2\n38.2\n\ny\nk\ns\n\n65.3\n70.8\n74.3\n74.9\n68.5\n70.0\n75.1\n75.1\n76.8\n77.3\n78.1\n\nn\no\ns\nr\ne\np\n\n58.7\n51.4\n58.2\n61.8\n54.5\n57.8\n61.9\n62.0\n62.1\n61.5\n58.6\n\nr\ne\nd\ni\nr\n\n29.0\n26.9\n24.9\n30.2\n23.8\n25.7\n28.5\n29.4\n28.2\n34.2\n34.6\n\nr\na\nc\n\n53.1\n81.5\n84.8\n84.0\n85.7\n85.8\n84.3\n84.6\n84.5\n84.9\n84.7\n\nk\nc\nu\nr\nt\n\n14.3\n14.5\n19.3\n15.9\n21.5\n21.9\n16.3\n16.6\n17.2\n19.4\n21.9\n\ns\nu\nb\n\n19.2\n25.0\n32.8\n36.7\n25.6\n27.7\n36.3\n35.7\n35.2\n41.7\n42.7\n\nn\ni\na\nr\nt\n\n7.9\n21.4\n22.6\n23.4\n0.7\n1.1\n29.1\n27.2\n32.0\n41.0\n41.1\n\nr\no\nt\no\nm\n\n15.1\n13.0\n15.0\n23.3\n13.9\n18.0\n25.0\n19.2\n25.8\n27.3\n29.3\n\ne\nk\ni\nb\n\n16.3\n7.9\n14.7\n31.7\n8.5\n11.1\n29.4\n28.4\n27.6\n32.0\n37.2\n\nmIoU gain\n34.1\n41.4\n44.3\n46.1\n41.2\n42.1\n46.6\n46.3\n47.2\n49.5\n50.2\n\n-\n7.3\n10.2\n12.0\n7.1\n8.0\n12.5\n12.2\n13.1\n15.4\n16.1\n\nSource only\nWarm-up\n+LtP\nCE\n+Lt\ndis + LtP\ndis\ndis + Lt\ndis\ndis + Lt\nCE + LtP\nCE\n\n+LsP\n+Ls\ndis + Lt\n+Lt\n\nCE\n\n+Ls\n\nCE\n\nCAG-UDA (Stage 1)\nCAG-UDA (Stage 2)\nCAG-UDA (Stage 3)\n\nAblation studies: The ablation study results are listed in Table 4. We add a superscript P to the\nsymbols of losses to denote that the active target samples are identi\ufb01ed by category probabilities\nas described in Section 3.3. Several models were trained by combining Lt\nCE with different losses.\nAs can be seen from the 2nd and 3rd rows, the proposed category anchor-guided PLA is more\neffective than the predicted category probability-based one. More detailed comparisons of different\nhyper-parameters can be found in the supplement. In addition, the CE loss is more effective than the\ndistance loss. The results in the 4th row demonstrate the complementarity between the CE loss and\ndistance loss, as well as between the category anchor-based and probability-based PLA. We combine\nthem as in Eq. (12) to train the CAG-UDA model and obtain a better result as listed in the bottom row.\nFinally, the stagewise trained CAG-UDA model obtains an mIoU of 50.2, outperforming the SOTA\nmodels. Besides, the CAG-UDA model has been trained for an extra stage, e.g., Stage 4. However, it\nis saturated at 50.2 mIoU with no improvement.\n\n4.3 Limitations\n\nThe proposed CAG-UDA model relies on reliable pseudo-labels to guarantee a correct supervision\nimposed on the network to be trained. To this end, we adopt a warm-up strategy to roughly align\ntwo domains together and increase the reliability of the generated pseudo-labels by the CAs, as\ndescribed in Section 3.4. In contrast, we also conducted an experiment by removing the warm-up\nstage and observed a signi\ufb01cant drop of 6.3 mIoU. Some techniques can also be used to obtain reliable\npseudo-labels such as enforcing local smoothness on the probability map, utilizing a normalized\nthreshold during assigning pseudo-labels, and reducing the appearance bias through a style transfer\nmodule. We leave it as the future work to build a stage-free and end-to-end CAG-UDA model.\n\n5 Conclusion\n\nIn this paper, we proposed a novel category anchor-guided (CAG) unsupervised domain adaptation\n(UDA) model for semantic segmentation. The CAG-UDA model successfully adapts the segmentation\nmodel to the target domain through category-wise feature alignment guided by category anchors.\nSpeci\ufb01cally, we proposed a category anchor construction module, an active target sample identi\ufb01cation\nmodule, and a pseudo-label assignment module. We utilized a distance loss and a CE loss based on\nthe identi\ufb01ed active target samples, which complementarily enhance the adaptation performance. We\nalso proposed a stagewise training mechanism to reduce the error accumulation and adapt the CAG-\nUDA model progressively. The experiments on the GTA5 and SYNTHIA datasets demonstrate the\nsuperiority of the CAG-UDA model over representative methods on generalization to the Cityscapes\ndataset.\n\nAcknowledgements\n\nThis work is supported by the Australian Research Council Project FL-170100117 and the National\nNatural Science Foundation of China Project 61806062.\n\n9\n\n\fReferences\n[1] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain\nadaptation with generative adversarial networks. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), pages 3722\u20133731, 2017.\n\n[2] W. Chang, H. Wang, W. Peng, and W. Chiu. All about structure: Adapting structural information across\n\ndomains for boosting semantic segmentation. CoRR, abs/1903.12212, 2019.\n\n[3] C. Chen, W. Xie, T. Xu, W. Huang, Y. Rong, X. Ding, Y. Huang, and J. Huang. Progressive feature\n\nalignment for unsupervised domain adaptation. arXiv preprint arXiv:1811.08585, 2018.\n\n[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image\nsegmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions\non pattern analysis and machine intelligence, 40(4):834\u2013848, 2017.\n\n[5] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable\nconvolution for semantic image segmentation. In Proceedings of the European Conference on Computer\nVision (ECCV), pages 801\u2013818, 2018.\n\n[6] M. Chen, K. Q. Weinberger, and J. Blitzer. Co-training for domain adaptation. In Advances in Neural\n\nInformation Processing Systems (Neurips), pages 2456\u20132464, 2011.\n\n[7] Y.-H. Chen, W.-Y. Chen, Y.-T. Chen, B.-C. Tsai, Y.-C. Frank Wang, and M. Sun. No more discrimination:\nCross city adaptation of road scene segmenters. In Proceedings of the IEEE International Conference on\nComputer Vision (ICCV), pages 1992\u20132001, 2017.\n\n[8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and\nB. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE\nConference on Computer Vision and Pattern Recognition (CVPR), pages 3213\u20133223, 2016.\n\n[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\ndatabase. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 248\u2013255. Ieee, 2009.\n\n[10] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes\n\n(voc) challenge. International journal of computer vision, 88(2):303\u2013338, 2010.\n\n[11] K. He, G. Gkioxari, P. Doll\u00e1r, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE International\n\nConference on Computer Vision (ICCV), pages 2961\u20132969, 2017.\n\n[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770\u2013778, 2016.\n\n[13] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. Cycada: Cycle-\nconsistent adversarial domain adaptation. In International Conference on Machine Learning (ICML),\n2018.\n\n[14] J. Hoffman, D. Wang, F. Yu, and T. Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based\n\nadaptation. arXiv preprint arXiv:1612.02649, 2016.\n\n[15] W. Hong, Z. Wang, M. Yang, and J. Yuan. Conditional generative adversarial network for structured\ndomain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 1335\u20131344, 2018.\n\n[16] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa. Cross-domain weakly-supervised object detection\nthrough progressive domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 5001\u20135009, 2018.\n\n[17] W. Jiang, H. Gao, W. Lu, W. Liu, F.-L. Chung, and H. Huang. Stacked robust adaptively regularized auto-\nregressions for domain adaptation. IEEE Transactions on Knowledge and Data Engineering, 31(3):561\u2013\n574, 2018.\n\n[18] W. Jiang, W. Liu, and F.-l. Chung. Knowledge transfer for spectral clustering. Pattern Recognition,\n\n81:484\u2013496, 2018.\n\n[19] G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann. Contrastive adaptation network for unsupervised\n\ndomain adaptation. arXiv preprint arXiv:1901.00976, 2019.\n\n[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems (Neurips), pages 1097\u20131105, 2012.\n\n[21] Y. Li, L. Yuan, and N. Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation.\n\narXiv preprint arXiv:1904.10620, 2019.\n\n[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\ncoco: Common objects in context. In Proceedings of the European Conference on Computer Vision\n(ECCV), pages 740\u2013755. Springer, 2014.\n\n[23] M.-Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in Neural Information\n\nProcessing Systems (Neurips), pages 469\u2013477, 2016.\n\n[24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431\u2013\n3440, 2015.\n\n[25] M. Long, Y. Cao, J. Wang, and M. Jordan. Learning transferable features with deep adaptation networks.\n\nIn International Conference on Machine Learning (ICML), pages 97\u2013105, 2015.\n\n[26] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang. Taking a closer look at domain shift: Category-level\n\nadversaries for semantics consistent domain adaptation. arXiv preprint arXiv:1809.09478, 2018.\n\n[27] Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim. Image to image translation for domain\nadaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n\n10\n\n\fpages 4500\u20134509, 2018.\n\n[28] P. O. Pinheiro. Unsupervised domain adaptation with similarity learning. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), pages 8004\u20138013, 2018.\n\n[29] G.-J. Qi, W. Liu, C. Aggarwal, and T. Huang. Joint intermodal and intramodal label transfers for extremely\nrare or unseen classes. IEEE transactions on pattern analysis and machine intelligence, 39(7):1360\u20131373,\n2016.\n\n[30] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region\nproposal networks. In Advances in Neural Information Processing Systems (Neurips), pages 91\u201399, 2015.\n[31] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In\n\nProceedings of the European Conference on Computer Vision (ECCV), pages 102\u2013118. Springer, 2016.\n\n[32] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection\nof synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), pages 3234\u20133243, 2016.\n\n[33] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classi\ufb01er discrepancy for unsupervised\ndomain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), pages 3723\u20133732, 2018.\n\n[34] S. Sankaranarayanan, Y. Balaji, A. Jain, S. Nam Lim, and R. Chellappa. Learning from synthetic data:\nAddressing domain shift for semantic segmentation. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition (CVPR), pages 3752\u20133761, 2018.\n\n[35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.\n\nInternational Conference on Learning Representations (ICLR), 2015.\n\n[36] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structured\noutput space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), pages 7472\u20137481, 2018.\n\n[37] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7167\u2013\n7176, 2017.\n\n[38] D. Vazquez, A. M. Lopez, J. Marin, D. Ponsa, and D. Geronimo. Virtual and real world adaptation for\npedestrian detection. IEEE transactions on pattern analysis and machine intelligence, 36(4):797\u2013809,\n2014.\n\n[39] T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. P\u00e9rez. Advent: Adversarial entropy minimization for domain\n\nadaptation in semantic segmentation. arXiv preprint arXiv:1811.12833, 2018.\n\n[40] Z. Wu, X. Han, Y.-L. Lin, M. Gokhan Uzunbas, T. Goldstein, S. Nam Lim, and L. S. Davis. Dcan: Dual\nchannel-wise alignment networks for unsupervised scene adaptation. In Proceedings of the European\nConference on Computer Vision (ECCV), pages 518\u2013534, 2018.\n\n[41] S. Xie, Z. Zheng, L. Chen, and C. Chen. Learning semantic representations for unsupervised domain\n\nadaptation. In International Conference on Machine Learning (ICML), pages 5419\u20135428, 2018.\n\n[42] J. Zhang, Y. Cao, S. Fang, Y. Kang, and C. Wen Chen. Fast haze removal for nighttime image using\nmaximum re\ufb02ectance prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 7418\u20137426, 2017.\n\n[43] J. Zhang, Y. Cao, Y. Wang, C. Wen, and C. W. Chen. Fully point-wise convolutional neural network for\nmodeling statistical regularities in natural images. In 2018 ACM Multimedia Conference on Multimedia\nConference, pages 984\u2013992. ACM, 2018.\n\n[44] J. Zhang and D. Tao. Famed-net: A fast and accurate multi-scale end-to-end dehazing network. IEEE\n\nTransactions on Image Processing, 29:72\u201384, 2020.\n\n[45] Y. Zhang, P. David, and B. Gong. Curriculum domain adaptation for semantic segmentation of urban scenes.\nIn Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2020\u20132030, 2017.\n[46] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), pages 2881\u20132890, 2017.\n\n[47] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang. Unsupervised domain adaptation for semantic segmentation\nvia class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV),\npages 289\u2013305, 2018.\n\n11\n\n\f", "award": [], "sourceid": 230, "authors": [{"given_name": "Qiming", "family_name": "ZHANG", "institution": "University of Sydney"}, {"given_name": "Jing", "family_name": "Zhang", "institution": "The University of Sydney"}, {"given_name": "Wei", "family_name": "Liu", "institution": "Tencent AI Lab"}, {"given_name": "Dacheng", "family_name": "Tao", "institution": "University of Sydney"}]}