Paper ID: | 1107 |
---|---|

Title: | Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting |

The paper presents a novel meta-learning method as well as the detailed algorithm and analysis of its convergence property. The proposed method can adaptively learn an explicit weighting function directly from data. The weighting function is a MLP with one hidden layer and the method can fit a wide range of weighting functions. The paper discusses lots of related work and analyzes the pros and cons. Experimental results show that the proposed method achieves better performance compared with the state-of-the-art methods. The paper is well organized, the figures and tables are clear. But there are also some handwriting errors. For example, at line 180, there should be “stable” rather than “stale”. At line 69, there should be "tradition" instead of "traditional".

This paper proposed to reweight samples using a simple one-layer MLP in a meta-learning manner. The proposed method is both theoretically and empirically justified. Theoretically, the convergence of the proposed method is proofed. Empirically, the proposed method is justified in both class imbalance and noisy label problems. Learning sample-reweighting from data is not a new thing. As introduced in related work section, there are other methods in this line, e.g. MentorNet and Learning to Reweight (L2RW). MentorNet also learns to reweight samples from data with the aid of a neural network (a LSTM). Can you discuss more about how the choice of these explicit reweighting functions influence the results? In noisy label experiments, the classifiers are trained in only 40 epochs for uniform noise, and 60 epochs for flip noise. My concern is that not all methods can converge in too few epochs. It more be more clearly if the whole tendencies of different methods are compared with more training epochs, e.g. 200 epochs. Without seeing the whole tendencies, we cannot simply say that the proposed method is converged faster than other methods (as claimed in line 101 of supplementary materials).

This paper studies the problem of learning from biased training data (i.e. distribution shift) with the help of a small set of unbiased meta-data. This covers notably the case of class imbalance and noisy label. The proposed meta-weight-net is an MLP with one hidden layer that learns a mapping from training loss of a sample to its weight. Minimizing the training objectives naturally leads us to focus more on samples that agree with the meat-knowledge. Theoretically it is shown that the algorithm converges to critical points of the loss under classical assumptions (but I am quite confused by the proof, see below). The experimental results are very promising. - Pros: The problem that is studied by this paper is important and the idea of learning a meta-weight-net is interesting and reasonable. On the one hand, we get rid of the burden of weight function design and hyperparameter tuning compared with adhoc sample reweighting strategies. Although this comes at the cost of having some meta data, such requirement should be feasible in many cases. On the other hand, compared with other weight learning methods, here one only learns a mapping from training loss to sample weight. The procedure is thus simpler but should be sufficient if we suppose there is some regularity in the optimal weight that needs to be assigned to each sample and this weight is related to the training loss of the sample. The experimental part is very complete. In both class imbalance and noisy label settings, the authors compare with a bunch of baseline methods and show that meta-weight-net learning effectively performs the best most of the time. Different datasets (though all of them are image ones) and model structures are considered, strengthening the above claim. The experiments are also conducted with different level of class imbalance or label noise. - Questions/issues: My biggest concern is on the correctness of the theorems and proofs, though it constitutes a less important part of this paper. In fact, I can not understand how \mathcal{L}^{meta}(\Theta) is defined in Theorem 1 and appendix. If we use (2) of the main paper as the definition, it seems that we need somehow a way to compute w*. By the way, the notation suggests rather \mathcal{L}^{meta} is a function of w, and the same holds for (4) of the supplementary material. On the other hand, if we use line 33 of the supplementary, which seems to be the case in the proof, I do not understand how it is possible that \mathcal{L}^{meta} does not depend on w. Meanwhile, suppose that there is a dependence of \mathcal{L}^{meta} on w, the proof of theorem 1 cannot hold anymore. For example, for the inequality after line 47 (supp), \Theta^(t+1) and \Theta^(t) are not evaluated with the same w so I have a doubt on whether we can write an inequality like this. In lemma 1, it seems that it is proved that the gradient of \mathcal{L}^{meta} is bounded (with the definition given in line 33-supp). Nonetheless, to prove that \mathcal{L}^{meta} is Lipshitz smooth we need to prove that the Jacobian of the gradient operator is bounded. Finally, in the proof of theorem 2 line 71, the authors claim \sum... < \infty. I can not see how we can draw such a conclusion since we do not really have a telescopic sum in (21) (with two \Theta^{t+1} at the left of the inequality). To conclude, I feel the proof should be much more involved to prove some kind of convergence results, and I have a doubt on the current proofs of the two theorems. - Minor points: 1. In the mathematical formula after line 75 in supplementary material, shouldn't there be a l2 norm in the sixth line and a different bound for \mathcal{L}^{meta} instead of \rho in the end? There seem to be some other typos but I would like to first understand the points that I mention above so I do not list them here. 2. I do not like the fact the paper puts emphasis on that a neural network with one hidden layer is a universal approximator. This is just a theoretical result and gives little insights into the true capacity of a neural network with a fixed number of neurons. I think the important thing here is the optimal weighting function may be simple enough to be approximated by an MLP with a single hidden layer. ---- At the beginning I tend towards voting for accepting the paper since the proposed algorithm is sensible and the experimental part is strong. However, there might be some misunderstanding in my understanding of the theorems and their proofs. As I would like to avoid accepting papers with wrong theoretical statements (even though these might not be that important for this paper), I cannot vote for accept for the time being. ============================================================== After the rebuttal I would like to thank the authors for their detailed reply. The authors' rebuttal has clarified my concern on the definition of \mathcal{L}^{meta}. However, I still have some doubts on how the telescopic sums are formulated in the proof of Theorem 1 and 2. For example in the equation following line 26 in the rebuttal, there are w^(t+1),\Theta^(t+1) and w^(t),\Theta^(t+1) which will not disappear after taking the sum. The same question also occurs for the equation following line 42. I hope these questions can be resolved after the revision.