Review for NeurIPS paper: FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

NeurIPS 2020

FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

Review 1

Summary and Contributions: The paper proposes a rather simple but efficient algorithm for semi-supervised learning. The algorithm is based on the previously proposed teacher-student architecture. The novelty is 1) that the teacher always receives weakly augmented samples (flip and shift) and the student receives strongly augmented samples (RandAugment and CTAugment proposed previously); 2) instead of computing the loss for all unlabeled examples, the loss is computed only for unlabeled examples for which the teacher is confident, the target for the student is one-hot coded label instead of the distribution (as was done previously).

Strengths: - The proposed algorithm is simple and efficient. The experimental results are good. - It is very nice that one does not have to use ramp-ups for the weight of the consistency cost. The proposed solution is more elegant. - I liked the ablation study on optimisers (Table 7 in the appendix). That study shows quite high sensitivity of the SSL performance on the parameters of the optimizer.

Weaknesses: - The results are similar to the previous state of the art. The proposed method seems like a small modification of the existing algorithms. Below are some points that apply to many SSL papers published recently: - It is unclear whether the proposed algorithm can be extended to other domains (not image classification). - I wonder how much of the success of this and other similar SSL methods is due to our knowledge of the domain (image classification) that comes from training on large labeled data sets. Specifically we know what architectural choices work and do not work on particular datasets when training with many labels. One indicator of that is the usage of different hyperparameters for the smaller datasets and ImageNet in the paper. - Is the scenario considered in the paper realistic for many practical applications? I think that in most applications, the unlabeled samples do not come from the same classes.

Correctness: The methodology is consistent with the recently published SSL papers. However, the ablation study on optimisers suggests that the choice of the optimiser has large impact on the results. In the light of this result, it seems that this paper (and most other SSL papers) overfit to the (large labeled) test set. Is it so?

Clarity: The paper is well written.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: - Equation (1) is difficult to understand because the two terms look exactly the same. - I would like to see more details on the loss function used in the experiment reported in Fig. 3b.

Review 2

Summary and Contributions: The work combines the pseudo-labeling and the consistency regularization techniques in SSL to propose a simple method for high-performing semi-supervised learning. The main idea is to use weakly-augmented images to produce pseudo-labels, and then train the model on heavy-augmented images with these labels.

Strengths: The paper presents a simplified version of earlier works, such as UDA and ReMixMatch, while achieving performance on par with ReMixMatch. Additionally, some experiments in applying the method to few-shot learning task are presented. An extensive ablation study (in supplementary material) is provided.

Weaknesses: 1. I cite from ReMixMatch figure caption: "Augmentation anchoring. We use the prediction for a weakly augmented image (green, middle) as the target for predictions on strong augmentations of the same image". This sounds to me as a summary of the presented work, and as such I consider it a special case of the ReMixMatch. Authors have discussed the differences between their work and ReMixMatch, mentioning that (1) "ReMixMatch don`t use pseudo labeling", and (2) ReMixMatch uses sharpening of pseudolabels and weight annealing of the unlabeled data loss. However, in section 3.2.1 of ReMixMatch, it is stated that the guessed labels are used as targets (for strongly augmented images) using cross-entropy loss. I believe this is called self-training with pseudo-labeling, just as this work proposes. 2. It is stated (lines 213-215) that FixMatch substantially outperforms MixMatch, ReMixMatch, and UDA with 40 and 250 labels, but this is incorrect. The performance of ReMixMatch is very close to the FixMAtch in this regime (and outperforms the FixMatch with more data). 3. The "Barely supervised learning" section describes 1-shot experiments in some setting that approximates the standard few-shot training/test regime (i.e., with episodes). The authors are encouraged to align with standard few-shot protocols and compare their performance to other methods in the data-starved regime (e.g., ReMixMatch).

Correctness: The correctness of the presented expressions seems to be fine.

Clarity: The paper is clearly written with detailed explanations.

Relation to Prior Work: The discussion of prior art is rich, as it is an important part of this paper.

Reproducibility: Yes

Additional Feedback: The presented work has clear practical value, as it distills the simple and powerful techniques in SSL that deliver SOTA results. However, I find the novelty limited due to the fact these techniques were presented in earlier frameworks, and it is not clear how are they more complicated and difficult to manage.

Review 3

Summary and Contributions: They propose a new approach for semi-supervised learning (SSL) that gives the pseudo-label to a strongly-augmented image using the pseudo-label of weakly-augmented image. Despite the simplicity of their method, they achieve very strong results in SSL benchmark settings.

Strengths: 1. Their proposed method can be a good direction for semi-supervised learning. Although there are several SSL methods effectively using data augmentation such as mix-up, the proposed new approach seems to have different aspects from previous works. The FixMatch is a simple, yet effective SSL method. 2. Empirical evaluation is very carefully designed. The evaluation sufficiently shows the effectiveness of their approach. 3. Analysis on augmentation strategy and sharpening also provide good insights.

Weaknesses: 1. The method does not have good explanation on why guiding predictions for strongly-augmented images by weakly-augmented images works so well. Although this issue can be common with other works, it is good if they provide empirical or theoretically analysis on this point. 2. Is it always easy to define "weak" and "strong" augmentation? They defined two kinds of augmentation in a heuristic way. But, maybe for some datasets, their defined "weak" augmentation can be a "strong" augmentation? I cannot come up with a good example, but the way of data augmentation can be different from datasets to datasets.

Correctness: They are correct.

Clarity: The paper is very well written. It is very easy to follow.

Relation to Prior Work: They provide clear distinction from other works including pre-print works. Their contribution is clearly stated.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper proposes a method called FixMatch for SSL. It simply treats predictions of weak augmented images as pseudo labels with argmax operation, and manages to minimize the loss between pseudo-label and predictions from strong augmented images. The method achieves promising results on several image recognition datasets.

Strengths: 1. The paper is well written and easy to follow. 2. The performances on several datasets are promising

Weaknesses: 1. The novelty of the proposed method is incremental. Both hard pseudo labels and consistency ideas are proposed in previous literatures. 2. Why does hard pseudo-labeling perform better than soft sharpening ? The soft-labels are widely used in image classification tasks (in Both supervised and SSL settings) to improve the model generalization capacity. When there are only a few labels, there is a high chance that predictions of unlabeled samples are incorrect. The wrong predictions may lead to more noise labels compared to soft labels. 3. The performance on CIFAR100 of the proposed method is inferior to ReMixMatch. The authors demonstrate that the model can achieve best performance with distribution alignment. Why does the result vary so heavily? Is this happening on ImageNet too? Is this related to the number of classes? Minor Comments 1. It would be better if the results on ImageNet are compared in a table.

Correctness: Yes/Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: ***Rebuttal Response*** Thanks for the author's response. As all my concerns are addressed in the feedback, I will raise the score to accept the paper.