NeurIPS 2020

MESA: Boost Ensemble Imbalanced Learning with MEta-SAmpler

Review 1

Summary and Contributions: The paper introduces a so-called meta sampler that adaptivity undersamples imbalanced data, using an iterative approach. The authors compare their work to others working in the field and the results indicate that their method is promising.

Strengths: This is an important topic of research.

Weaknesses: The algorithm focuses on undersampling, rather than oversampling. This is worrisome, since the comparative study is with SMOTE-like algorithms. Thus, one would expect to also see some comparison with random oversampling, etc. and a deeper explanation about this design choice. The paper is difficult to follow and needs to be edited prior to publication.

Correctness: n equation 3, authors decide to use the Gaussian function for determining the sampling weight. This choice needs some justification, and also the fact that these weights are unnormalised needs to be verified. Further, the choice of using the "soft actor critic" approach is not motivated. The results in Tables 2 and 3 come as a surprise. Notably, the poor performance of the SMOTE-based methods seems suspect and needs a careful explanation.

Clarity: The clarity could be improved, notably when explaining the design decisions and the experimental results. The paper contains numerous typographical errors; for instance the authors refer to "Native Bayes"? Also, the ordering of the references are incorrect, e.g. [32, 25, 6] instead of [6, 25, 32].

Relation to Prior Work: The authors missed many related works, notable in the cost-sensitive learning domain.

Reproducibility: No

Additional Feedback:

Review 2

Summary and Contributions: Imbalanced learning has been a hot topic in machine learning since it can be easily encountered in most real-world applications, where data are not only scarce but also heavily biased toward one class. Mesa, the proposed framework, aims at tackling this learning problem through a novel approach combining meta-learning, resampling techniques, ensemble learning and reinforcement learning. While theoretical justifications are meagre, the paper is rich in technical details on how the 4 aforementioned frameworks are combined, as well as empirical results demonstrating the effectiveness of Mesa.

Strengths: The main strength of the paper resides in combining 4 established learning frameworks into a new (and novel) learning approach. The resulting algorithm is made up of three separate procedures: resampling, ensemble learning and meta learning. In my opinion, the most important contribution of this paper is the cross-task transferability of the meta-sampler, which opens up several interesting research directions, e.g. applying a pre-trained meta-sampler to small datasets (where classical IL methods usually fail to train pertinent models) and large datasets (scalability). The empirical section is also a strong contribution as it is particularly rich in results covering various facets of Mesa.

Weaknesses: I have only a few remarks on this paper, even though they shouldn't be considered as weaknesses. They are listed below in no particular order. - in eq.1 | is used both as the absolute value operator and the cardinality one, which can lead to confusion - in eq.2, \tau and v have not been previously defined (unless I'm missing something) - I find it regrettable that no theoretical analysis of Mesa (e.g. convergence speed, generalization error, etc) is proposed aside from the complexity one, especially since it is built upon frameworks with strong theoretical properties - line 155 "is thus can be" typo - line 173 reference error "Haarnoja et al." - line 233 compares > compared - table 2, what does k correspond to? Is it the parameter of Algorithm 2? - a few more datasets would've been appreciated, especially concerning the cross-task transferability

Correctness: There are little to no theoretical results in this paper, as it is clearly a more empirical one. The experimentation protocol is clearly explained and defined.

Clarity: The paper is very well written and easy to follow. Quite a pleasant read.

Relation to Prior Work: The authors propose an extensive review of related work and do a nice job at stating the importance of the contribution w.r.t. existing works.

Reproducibility: Yes

Additional Feedback: after rebuttal: I find the answers provided in the rebuttal satisfying, as such I'm keeping my original rating for this paper.

Review 3

Summary and Contributions: This paper targets on addressing the binary imbalance classification problem, where the negative samples are much more than the positive samples. The main idea is to learn an ensemble of classifiers, each of which is learned with instances sampled by a meta-sampler. The meta-sampler takes input the histogram error distribution of training and validation set, and outputs a scalar value \mu, from which the sampling weight is calculated. The instances with error closer to \mu have higher weights for sampling. For mapping the histogram error distribution to value \mu, authors proposed to learn a policy network.

Strengths: The designed meta-sampler can adaptively sample the instances to construct a balanced positive and negative training dataset, from which a member of the classifier ensemble is learned. Therefore, the proposed approach performs well on the imbalanced classification problem, even with noise and corrupted labels. Overall, it is an interesting work. The meta-sampler can effectively address the class imbalance problem.

Weaknesses: There are some errors in Algorithm 1, which may be typos. These errors are given in detailed comments later.

Correctness: The proposed solution has been compared with several groups of baselines. The evaluation results show the superior performance of the proposed solution. In addition, extensive results in the supplementary document further confirms the usefulness of the proposed meta-sampler.

Clarity: The paper is well written.

Relation to Prior Work: The related work is clearly discussed, and the contribution is appropriately highlighted.

Reproducibility: Yes

Additional Feedback: In algorithm 1, line 1, P_t is the majority and N_t is the minority set. However, in line 3, why sampling majority subset N’_t from N_t? In each iteration, the training subset is sampled from the original training set. That’s to say, for each training instance in the original training set, a classification error is evaluated by the current ensemble model F_{t}? However, in line 2 of Algorithm 1, x_i is classified by F. And errors like "A instance is " should be corrected. Authors' feedback addressed my comments.

Review 4

Summary and Contributions: This paper deals with supervised learning (classification) under class imbalance. The authors propose to address this issue by using a meta/ensemble-learning framework. In this framework, the meta-algorithm deduces an appropriate data sampling strategy that generates a data set for a new base learner to train. The meta-learner is trained using reinforcement learning. The meta-state is composed of two histograms that are respectively the empirical distributions of the training and validation error. The meta-sampler uses this state to sample a coefficient. This coefficient is used as the mean of a Gaussian from which sampling weights are obtained. These weights are used to obtain a balanced dataset that is used for training by the new base learner. The aggregation rule of base learners is the average of their soft outputs. The validation and training errors of the updated ensemble can be computed thereby reaching a new meta-state. The meta-sampler is regarded as a policy in the RL framework, i.e. mapping states to actions. It is trained using the soft actor/critic deep RL algorithm which is suitable for continuous state and actions spaces. The main contributions are : (i) a new meta-ML framework for imbalanced data that is not tied to a specific type of base learner, (ii) a meta-learner that is task-agnostic and can be re-used for different ensembles of base learners addressing different tasks. The method compares favorably to prior arts.

Strengths: The main strength of the proposed method is its flexibility and re-usability (see reported contributions above). In my opinion, an interesting aspect w.r.t. novelty is the use of RL techniques to train the meta-learner. The paper addresses an open problem in ML and is thus perfectly suited for a submission at NeurIPS.

Weaknesses: The main weakness of the proposed framework is that a few aspects of the authors' construct are somewhat arbitrary. In particular, the choice of a Gaussian distribution to compute weights is not clearly justified and does not seem very natural. By the way, the impact of this choice is not discussed. Can't the RL agent directly issue a set of weights ? Likewise, the use of a linear aggregation rule is not much discussed. Wouldn't an RNN, i.e. a trainable non-linear aggregation that can forget about a bad base learner (such as the initial one), be better ?

Correctness: Experiments in 4.1 should be extended to several imbalance ratios to feature the ability of the proposed method to handle different imbalance schemes. This is important as the validity of the methods relies solely on experimental evidence. Also, unless I’m mistaken, the total number of training examples does not appear in these synthetic experiments. Table 2 : I am unclear with the number k of base learner used in these experiments. Error bars are only given in Figure 4 and Table 4, such information must be given on the provided results (even those with so called comfortable margin). The origin of the variance of the results should be highlighted. The impact of the train/valid/test split is not commented. The experiments should be embedded in a stratified CV loop. Finally, a comparison to other meta-ML methods would also be welcome even if they are tied to DNN framework.

Clarity: The paper reads pretty well in my opinion.

Relation to Prior Work: The increment of over prior is clearly stated and a short review of the state-of-the-art is provided.

Reproducibility: Yes

Additional Feedback: Minor comments: The error e does not appear in (3). Another notation should be used to avoid confusion with Euler’s number in the Gaussian. In alg. 1 it is clear the input of the Gaussian distribution is the error (not so clear in the text). ---- Post author response: The authors have provided a number of additional experimental details which give stronger guarantees on the quality and relevance of their results. Consequently, I have updated my score from 6 to 7 although I am still a bit unsatisfied with the justification of the Gaussian function to compute the weights.