Review for NeurIPS paper: Heuristic Domain Adaptation

NeurIPS 2020

Heuristic Domain Adaptation

Review 1

Summary and Contributions: The paper presents a method inspired by heuristic search algorithms (such as A*) for the problem of Domain Adaptation (DA). The authors claim that to achieve domain invariant representations, one must explicitly model domain-specific characteristics. The ideal representation is considered as the goal and the intermediate domain-specific representations are regarded as the distance from the current representation to the ideal one. When the heuristic representations are near zero the terminal state is reached. To achieve this goal the authors propose a domain adaptation network made of a fundament network F and a heuristic network H (that is possibly divided into several subnetworks). Transferable representations are the result of F(x) - H(x) = G(x). To construct and reduce the domain-specific characteristics, the authors propose three constraints on the representations: (1) Similarity - limit the initial value of the heuristic representations based on the cosine similarity between G(x) and H(x); (2) Independence - make G(x) and H(x) independent components in the representation; (3) Termination - the range of heuristic representations H(x) should be near zero. Item (1) is achieved by initializing the fundament network parameters close to zero and items (2) and (3) are achieved by applying an L1-norm on H(x) termed heuristic loss. The overall loss term includes heuristic loss, classification loss, and transfer loss. An upper bound for the error on the target set was presented in the paper and the method was demonstrated on the challenging DomainNet and Office-home datasets. The main contributions of the paper are (1) model domain invariant and domain-specific information inspired by heuristic search processes, (2) propose HADA a framework for domain adaptation, and (3) state-of-the-art performance on three domain adaptation tasks: Unsupervised DA, multi-source DA, and semi-supervised DA.

Strengths: The paper has several strengths: 1. The method presented in the paper tries to model domain-specific characteristics as well as domain-invariant representation. Although not novel, this is an interesting approach and is less common in the current field. 2. The proposed approach is general and can be applied to many DA setups as was shown in the paper. Also, it seems rather simple to implement or add to current methods. 3. The method was demonstrated on the challenging datasets Office-home and DomainNet and achieved good results on both compared to baseline methods. 4. The theoretical insight appears sound and corresponds with similar derivations seen in this field.

Weaknesses: I would like to get clarifications about the following: 1. Based on the intuition from heuristic search the authors chose to subtract the representation of H(x) from F(x). This is one approach. Have you tried other approaches as well, for example, using another NN over concatenated representations with the gradient reversal layer applied only on the fundament network? 2. Line 118 states that each sub-network tends to model local domain-specific property. Why? Can it be shown (even empirically)? Do they model different properties? And if that is indeed correct why a summation over the sub-networks' representations is the right thing to do? 3. The authors published their code and deserve compliments on that. However, the paper lacks in presenting several implementation details. Some I was able to find in the code. For example, what networks were used for the fundament network and heuristic network? How the hyper-parameters were chosen? Was early stopping applied? If so, on which part of the dataset (train, val, test) and on which domain? I would like to get answers to these questions and in general more details. From the code, it seems that early stopping was done based on the target-test set. If that is indeed the case, in my opinion, it is very problematic in an unsupervised setup.

Correctness: The method appears to generate good results on challenging datasets, however, as stated, some design choices that are not clear to me. Also, I would like to understand better the experimental setup before I can pass it.

Clarity: The paper is well written and easy to follow. I found a few minor typos: - line 110 the word "the" appears twice. - Table 4, R->S column, 3-shot is not consistent with the rest of the columns. - In table 1 in the appendix, line 5 should be g(n) instead of f(n). - line 24 in the appendix should be h(m).

Relation to Prior Work: The authors addressed prior works and nicely described how their approach is different.

Reproducibility: No

Additional Feedback: The authors motivate their method by heuristic search algorithms. However, to me, it seems a bit artificial. I wasn't convinced how strong this relationship truly is (even after reading the supplementary material). I think that the method is legitimate without this motivation. --------------------------------------------------------------------------------------------------------- The authors addressed my concerns adequately and it appears that they addressed other reviewers' comments as well. In my opinion, this is a decent paper with a rather unique solution that should be accepted therefore, I decided to raise my score from 5 to 6.

Review 2

Summary and Contributions: The authors use a heuristic search approach to look for domain invariant and domain specific representation to address a variety of domain adaptation settings. Taking cues from A* search, the authors design a heuristic network as an ensemble of multiple subnetworks each capturing domain-specific properties to learn domain-specific representation, while a domain invariant network learns invariant representation. Using constraints redoing non-gaussianity using Kurtosis measure and termination heuristic, the search for an optimal transfer function is architected.

Strengths: + The paper theoretically shows an upper bound on the error over the generative function to indicate the search for an optimal function is possible with low error. + Experimental results demonstrate the ability of the proposed approach over semi-supervised and unsupervised DA, and also indicates how the number of subnetworks chosen affects model performance.

Weaknesses: There are, however, a few unanswered questions. 1. Are the comparative results statistically significant? 2. In the ablation study, it is indicated that larger number of sub-networks (M) would be difficult to optimize. Is this true in the multi-source DA setting as well? If so, how does the choice of M affect the performance when there are different number of domain-specific properties within each domain? 3. What are the limitations of the proposed solution?

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Authors have adequately addressed my concerns in their rebuttal.

Review 3

Summary and Contributions: This paper explores to explicitly separate domain-specific characteristics from existing representations by a unified heuristic function. The heuristic function is achieved/enhanced by multiple subnetworks to ensure the accurate construction of domain-specific properties. This heuristic function is further integrated into adversarial domain adaptation, which formulate the framework of heuristic adversarial domain adaptation (HADA). Experimental results show that HADA could achieve state-of-the-art results on unsupervised domain adaptation, multi-source domain adaptation and semi-supervised domain adaptation.

Strengths: 1. It solves domain adaptation problem from a heuristic search perspective and provides a unified manner/framework to achieve domain-invariant feature learning. 2. Clear motivation and problem definition. 3. Satisfying performance on three different domain adaptation tasks.

Weaknesses: 1. Authors highlight “HADA could achieve lower error bound” several times in both the main paper and supplementary, but I am very confused by this statement. What is the intuitive meaning of that? In addition, how is this property related the superiority of HADA, any comparison? 2. Experiment wise, there are not enough ablation studies to support the superiority of using heuristic search theory. Besides, this paper has used multiple loss items. It would be helpful to know which one contributes more. 3. Typo. n-th at Line-104.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I'm curious what is the performance If we discard the heuristic search and incorporate some commonly used disentangle techniques to separate different properties. -------------------------------------------------------------------------------------- Authors addressed my concerns during rebuttal period.