Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Originality: The idea is interesting, and to the best of my knowledge hasn't been tested before. However, applying gradient descent to update parameters is not very original. Clarity: Overall, the paper is well written. Te structure can be improved however. For example, having the related works sections right before the conclusion is quite disturbing. Quality: The description of the method is sound and technically correct. The experimental section is well designed and fairly assess the method. I particularly appreciate the experiments with noisy labels. However, I have some concerns. Can the authors comment on the importance of the instance weights? Adding a trainable weight per sample can substantially increase the size and capacity of the model. This can make the comparison to the baselines unfair. Moreover, a missing analysis in the paper is to track and try to analyze the parameters in order to check if they actually do what they are expected to, i.e. prioritize the easy samples in the beginning of training and progressively evolve towards uniform importance. Significance: The experimental results are extensive and show significant improvement over the state-of-the-art models in several settings and for several datasets. This proves the the interest of the method, but again, it is not clear to me how much of this performance gain is due to the curriculum learning mechanism.
Originality: The intuition of the model is similar to existing curriculum base model (e.g. MentorNet), but the approach to estimate the weight of each example is new. In this paper, the authors proposed to learn the weight of each instance/class with a gradient descent. The related work section is clear and explain the difference with existing approaches. Quality: The proposed dynamic curriculum learning framework is validated on image classification and object detection tasks. In particular, the proposed curriculum based approach outperforms existing curriculum based approaches like MentorNet. The model is also accurate when it learns with noisy labels because it learns to ignore noisy training examples. Clarity: The submission is clearly written and well organised. Significance: A lot of approaches have been proposed recently to speed-up the training of deep models. I think that using curriculum based models is an interesting direction to speed-up the convergence. I also think that designing models that are robust to noisy labels is an important problem because a lot of real data contains noisy labels. --------------------------------------------- The rebuttal addresses my concerns and I think it is a good paper.
The paper presents a novel approach to curriculum learning, by introducing two new sets of parameters to learn, one per class and one per example, which correspond to temperatures in the softmax classification layer, and can easily be trained by gradient descent. When considering the class-based version (equation 1), I wonder why the model without that extra set of parameters cannot learn the same thing through the existing weights (it's just a scaling of the logit, after all). For instance-level temperatures, it makes more sense but again, this can only work with small enough datasets where each example is expected to be seen many times; in the context of a very large dataset, it's unlikely that each example would be seen enough by the model to learn a relevant temperature. Having a temperature parameter that is a function of the example (and maybe the label) could alleviate this (but this might become similar to other competing approaches, no?) I would suggest numbers given in the text page 5 around line 170 be put in the relavant Table 1 for ease of comparison. I didn't understand the difference suggested on page 8, line 285-286.