Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper addresses an important problem in a unique way. Self-supervision is a promising avenue of research, but currently relies on significant domain knowledge. The authors propose to overcome this with model agnostic meta learning. The experiments are fairly comprehensive and the exposition is clear and well-cited. Despite the originality of the work, the experiments do not currently make a very strong case for its significance. The comparisons in Table 1 show consistent but modest improvements in accuracy. While the improvements are greater than the run-to-run variation with different random seeds, it is unclear how such an improvement compares to variations in performance by modifying standard data augmentation and/or regularization. This makes it less clear how general the results would be. Similarly, the results on Cifar100 are suggestive but not very convincing. MAXL clearly performs significantly better than single task training and the random baseline, but the relative advantage over the k-means baseline seems to be at most 1% relative. It is difficult to tell by the graph presentation, and it's not clear what additional information is provided by the time series of the graph. Finally, it's nice that visualizations are presented, but the analysis of the visualizations is somewhat lacking. CNNs are notorious for modeling less salient features such as texture and global illuminance. Further quantitative or qualitative studies, such as salience / gradient visualization of the learned features could help illustrate the common characteristics.
Main Ideas The high level motivation is to combine the strengths of supervised and unsupervised methods for auxiliary task learning. The authors present a meta-learning algorithm that automatically determines labels of auxiliary tasks without manual labels. They study this method in the context of classification. Relation to Prior Work This is a straightforward application of gradient-based meta-learning. The formulation of tailoring the learning of the label-generator to the learning progress of the multi-task learner is an elegant formualtion of an iterative optimization procedure. Quality - strengths: the authors conducted a thorough analysis comparing MAXL with several baselines. - weakness: It would strengthen the paper to show an experiment that analyzes the weighting coefficient lambda on the entropy term. The authors state the collapsing class problem, but do not show an experiment highlighting why the problem is important. The number of auxiliary classes per primary class seems to be a hyperparameter; it would be informative if the authors could provide an analysis for how to choose this hyperparameter, as according to Figure 3 the choice of hyperparameter has a non-trivial effect on generalization performance. Clarity - strengths: the paper is very well written and motivated Originality - strengths: the proposed method seems to be novel Significance - strengths: MAXL can be in principled be applied to any classification task as long as the number (but not the identity) of auxiliary tasks is pre-defined. - weakness: While MAXL provides an improvement over single task learning as shown in Table 1, the improvement seems marginal. It would be informative for the authors to include a discussion for why MAXL could not improve generalization performance beyond one percentage point in all of the classification tasks.
Though I think this paper proposed a very interesting approach to automating the design of auxiliary tasks. I am disappointed by its practical value on the image classification tasks evaluated. According to Table 1, the method outperformed the standard single-task learning baseline by a very small margin (less than 1%) on all seven datasets. Why didn’t we see larger performance gains using the proposed approach? I’d hope to hear the authors’ hypothesis. Also, with what kind of datasets/models, the proposed method for generating auxiliary tasks is the most effective? I am also curious about the scalability of this approach with more advanced architectures and larger-datasets. In the experiments, larger images are rescaled to [32x32] low resolution. Could the authors comment on the reason of this design choice? Are there any technical constraints that limit the model in training with larger images? Would this model be effective in more realistic image classification setups, say training the state-of-the-art ResNet-101 for the ImageNet challenge? Another baseline that I’d like to see is to use variational encoders (e.g. beta-VAE) to learn latent discrete representations from the data and use the learned discrete code as labels for auxiliary learning. This could possibly be a stronger baseline than the k-means baseline. Fig. 4: what’s the intuition why the cosine similarity between the gradients of the primary task and the ones of the auxiliary task always possible? Would it be possible that the auxiliary tasks can generate gradients in the opposite direction to the gradient of the primary task so that it pulls it out of a local minimum? Fig: 2: is there any parameter sharing between the multi-task network and the label generation network? I appreciate the provided source code and the reported negative results in the supplementary materials.