Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper proposes a natural combination of two methods in FSL and SSL, (namely the (good) MTL and the (basic) self training, respectively), to address the problem of learning a classifier from few labeled and many unlabeled examples. However, the trivial composition of these methods brings almost no gain from using unlabeled samples, so an effort is made to make the self-training more robust to noise by involving an additional few-shot method (namely, the Relation Network, which is basically the Siamese network) and fine-tuning on just the labeled examples (which is allowed by the initial MAML training). The experimental results corroborate the efficiency of this method, so overall a good practical knowledge is shaped and delivered by the paper. In the light of the presented results, I wonder if the proposed soft weighting of the pseudo-labels can be used in the vanilla SSL task, composed with any method based on label propagation. Perhaps authors have some results or thoughts in this direction. One issue that bothers me in Table 2 of performance results is the low accuracy reported for the baseline methods  and . These results are lower than the concurrent performance of methods using just the few labeled examples, without the unlabeled ones. The presented ablation study is satisfying, since the performance of the different versions of the algorithm blocks helps to understand their vitality. The performance reduction due to distracting classes, demonstrated in Table 2, is a good additional
In this paper, the authors propose a novel semi-supervised meta-learning method called learning to self-train (LST) that leverages unlabeled data and meta-learnes how to cherry-pick unlabeled data to further improve performance. I think the most prominent part of the paper is the self-traing method, which is to combine the SWN (similar to the relation network) and the meta-learning training strategy to learn the relation between the unlabeled sample with the prototypes of each category, thus reducing the noise of uncertain pseudo-labels. The following aspects of the paper are insufficient or unclear: 1. The MTL (tieredImageNet) results presented in the upper part of Table 1 are indicated that “by us”. What does that mean ? 2. It might be unfair to compare with few-shot learning methods in Tabel 1. These methods neither use unlabeled set R, nor iterates 40 times to fine-tune during test. I think the statement is not suitable that “our LST performs best in both 1-shot (77.7%) and 5-shot (85.2%) and surpasses the state-of-the-art method  by 11.7% and 3.6% respectively for 1-shot and 5-shot.” in LINES 240 – 241. 3. Could most of few-shot learning methods be equipped with MTL to improve performances on semi-supervised few-shot classification task? 4. The results “mini w/D” and “tiered w/D” in Table2 are not very convincing. Is it the experiments that shows your method is unstable on semi-supervised few-shot learning, so that the best module selections vary greatly on different datasets? There are little explaination about the usage scenarios of “+recursive” and “+mixing”, and no theoretical proof for the explanation of LINES 259 – 261. 5. I think the method might not work well in reducing the effect of noisy labels (LINE 137) if using a lot of unlabeled data from distracting classes that are excluded in the support set and small samples from classes included in the support set. What is role of unlabeled data from distracting classes in the loss function Formula 5 (after LINE 157) ? Is it only “hard selection” strategy to reduce the impact of “false sample” from distracting classes? In summary, I think that the idea of the paper is a little bit of interesting, but contributes little. Because the scenario of few-shot learning problems is quite different from the experimental setup in the paper. The MTL method in the paper requires a large number of unlabeled samples, and iterations a lot to adjust the parameters during test. On the semi-supervised few-shot learning problems, I think that the method needs to focus on this situation when there is a big difference between the unlabeled sample categories and the supported sample categories. But I think the paper is lacking in innovation to solve this problem.
1. In the proposed training scheme, training on the data points with pseudo labels is followed by finetuning the model only on the labeled data. What will the model performance look like if finetuning on the labeled data is not used? 2. For the experimental results in Table 1, it seems only ResNet-12(pre) is used for the proposed method. What about other backbones? E.g., 4 CONV, which is used in the previous literature. Also, the comparison does not seem to include more recent approaches, e.g., RelationNet, dynamic few-shot visual learning without forgetting, etc. It would be nice to see a more extensive comparison with the previous approaches. 3. The cherry picking step is composed of a hard-selection and a soft-weighting step. What are the detailed statistics about how many unlabeled data points are filtered in each step? Also, what is the ratio of unlabeled data and the labeled data used during training? --- After rebuttal The feedback is satisfactory. I increase my score to 6.