Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper presents an unsupervised domain adaptation approach for semantic segmentation that uses category centroids (anchors) for category-wise feature alignment, trying to keep features from the same category nearby while keeping features from other categories far apart. Pseudo-labeling is used on target samples when sufficiently close to a source anchor (active samples). Pros: + To my knowledge, this is the first UDA approach for semantic segmentation that combines pseudo-labeling with feature alignment. Works applying similar ideas exist for classification [3,17,37], but the differences are sufficient and well acknowledged by the authors. + The presented model is sound and well described. Fig. 1 is helpful to understand the intuition behind the idea (b-c) as well as the actual architecture (a). + Sensible stage-wise training procedure to guarantee good initial anchors, although it makes the training more cumbersome. Also, it seems that it is not saturated in stage 3, would the results improve if trained for more stages? + Convincing results, especially for small classes in which CA-based PLA clearly seems to be crucial in providing good pseudo-labels. Moreover, the authors provide a possible explanation of the limitation of their method with respect to style-transfer for stuff classes. As a suggestion for possible future work, one could think in combining CAG with a style-transfer module to address stuff classes as well. Cons: - Since pseudo-labeling is done at the pixel level, the pseudo-labels are not necessarily very smooth. The shown pseudo-labels seem relatively smooth (fig. 2 and suppl.), but the method could benefit from enforcing local smoothness to increase the robustness of ATI or PLA. - It would be clearer to add the explicit definition of the loss in L208. - Missing related work (although for classification): "Unsupervised Domain Adaptation with Similarity Learning", Pinheiro, CVPR2018. Minor things - Citation missing in L98 for Li et al. - In eq. 4, and if x is defined as the image like in L128, the input of f_D should be something like Enc(x) (i.e. the encoded features) for coherence with Fig.1a. I understand this is a notation abuse for clarity, but this should be mentioned somewhere - A few typos: SYNTHIA in Tab.2's caption, outperforms in L253, etc. The authors addressed most of the reviewers's concerns in the rebuttal and thus I keep my acceptance score.
There are some interesting novel concepts introduced in the UDA framework proposed by the authors. Overall the paper is clearly written. I feel the paper could be improved by demonstrating the effect of using active samples, as these seem to be a prominent elements in the optimisation framework. The ablation results in Table 3 also indicate that +L_CE^tP and L_CE^t are both significant contributors to the final performance. I'd like to see more elaboration on the setting of Delta_d (used both in Eq.5 and 7), which surely decides the number of active samples in Source and Target domains - how some pseudo-label alignment may contribute/impact the final performance? What is the effect of changing the weights for Eq.11. Regarding the setting of Delta_d, which is set on the distance differences. Is this ideal? Depending on different datasets, the distance-based threshold can be quite changeable and hard to configure. Would it be more reasonable to assign a threshold in a normalised setting? Also the same threshold is used in both Source and Target anchors, which may not be optimal? Some minor corrections: The context around Eq.5 can be written as - Mathematically, this can be formulated as follows. We first define the distance between ... and the c^th category anchor as d_ijc^t =... (5) Then, we sort .... in an ascending order, and compare the shortest d_ijc with the second shortest d_ijc... .... we identify this target sample as [an] active one, ... It is better to also give the formula for L_CE^tP, given its importance. The comments right under (7) seem a bit loose. You said "they turn out to be more reliable than ...", without providing any evidence or justification. What do you mean by "they do not depend on the decision boundaries"?
In this paper, a category anchor based domain adaptation approach is proposed for semantic segmentation. The centroid of each source domain category is used as anchor for discovering confident target samples, in which difference of distances to anchors is used as the metric. Identified target samples are then labeled according to its closest anchor. Then the pseudo-labeled target samples are used for training the segmentation model. Distances to anchors are also used as a regularization to guide the training. Experiments on benchmark datasets validate the effectiveness of the proposed method. The paper is well written and easy to follow. One problem with the proposed method is that it involves many hyperparameters, for example, $delta_d$, $\lambda_1$ and $\lambda_2$. Although ablation study is provided in supplementary, it is still quite limited. It would be more convincing if a wider range of vlidatation on the hyperparameters are provided. How important is the warm-up stage? What if you remove it?