Review for NeurIPS paper: Training Stronger Baselines for Learning to Optimize

NeurIPS 2020

Training Stronger Baselines for Learning to Optimize

Review 1

Summary and Contributions: This paper proposes changes to L2O to make a generic L2O algorithm easier and faster to train using common techniques like curriculum and imitation learning. They show their method give significant improvements across many existing evaluation criteria and methods. EDIT after rebuttal: Thank you very much for clarifying our concerns. I really appreciate running the additional experiments we requested and will increase my score. As an extra suggestion, please include the validation/test loss plots in your paper as opposed to training. It is hard to evaluate if the method is truly outperforming or just overfitting to a small training set. Also, for figure c/d on rebuttal, please include baseline optimizers (sgd/adam) too. I'm not concerned rather you beat those, but it would be helpful for those to be included as a reference.

Strengths: The work is technically sound, although the techniques used are not necessarily novel, the authors did a good job at evaluating a large combination of them which is non-trivial. This paper focuses specifically on improving a generic L2O algorithm rather than making the argument that their method is the best. I see this as a significant contribution that is very relevant to the NeurIPS community. Their empirical results indicate their method is a significant improvement to the SOTA of L2O which is extremely impactful to the NeurIPS community in the long run.

Weaknesses: Although the authors did a great job comparing against other L2O methods. I felt like there could be more comparison against hand-designed optimizers like ADAM/RMSProp/etc. It is hard to gauge how far we are from replacing the hard-coded ones with learned ones without a side-by-side comparison. I would also be interested on seeing performance across significantly different architectures (e.g. ones available in NAS search-spaces). rather than small changes in architectures like number of layers/stride/size. The MLP -> Conv comparison is helpful, but limited.

Correctness: As far as I could tell the experiments and claim are correct.

Clarity: Paper was well written overall although the explanation for Figure 4 could be more clear.

Relation to Prior Work: I think the paper did a good job discussing related work.

Reproducibility: Yes

Additional Feedback: I don't fully understand paragraph in line 108 on how it is related to explore-exploit in RL. I think we are not gaining much from this analogy. Minor typos/suggestions: Algorithm 1: switch from while loop to do-while to make sense of the "stop" flag. Or make while True and use continue/break statements L106: "optimzees" L219: "optmizee problemsduring" L272 "100, 200, 500, 1, 000" The comma for 1,000 is confusing

Review 2

Summary and Contributions: This paper presents simple techniques to improve the performance of L2O-DM, which is a standard baseline for learning to optimization problems. One is curriculum learning by controlling the number of training unrolling steps, and the other is imitation learning incorporating analytical optimizers. Although there are no theoretical foundations, empirical results support its usefulness to strengthen the baseline methods. Besides, the techniques can be incorpolated in the SOTA L2O models as well.

Strengths: (1) The proposed techniques are simple yet reasonable. (2) Empirical results are extensive and well support the benefit of imitation learning and curriculum learning in the L2O problem. (3) The paper is well written and easy to follow. (4) The topic is relevant to NeurIPS community.

Weaknesses: I don't have any major comments. Below are minor comments and questions. (1) Figure 5 suggests that the CL does not generalize well to the change of model architecture (MLP to CNN) compared to the IL. Can you explain the possible reason? (2) The definition of Lf(\phi), which is written in the line 98, is some what weird for me since the right side of expression does not depend on \phi.

Correctness: Empirical methodology are correct. Claims are validated by the empirical results.

Clarity: This paper is well written and easy to follow.

Relation to Prior Work: This paper is discuessed the difference with prior works. As I'm not an expert on the L2O literatures, I might miss some important literatures on this fields, especially regarding the technical proposal.

Reproducibility: Yes

Additional Feedback: == After Rebuttal == Thanks the authors for the rebuttal. I didn't find any additional concerns during the rebuttal. I'd like to keep my score.

Review 3

Summary and Contributions: This paper proposed a progressive training strategy for L2O models and aims to improve performance by mitigating the dilemma of truncation bias v.s. gradient explosion. Moreover, an off-policy imitation learning method has been introduced to prevent the L2O model from overfitting. Several experiments were conducted to demonstrate the effectiveness of the proposed method.

Strengths: 1. This paper provided a new aspect for the L2O model, which is novel and interesting. 2. Motivation and solutions are straightforward and reasonable. 3. The effectiveness of the proposed approach has been sufficiently demonstrated and detailed discussed with acquaint experimental results and corresponding analysis. 4. The code is available for better reproducibility.

Weaknesses: I have a criticism that most techniques are “adaptations” from RL literature. But that is just nitpicking: as they are new to L2O, and the authors explained the specific motivation to do those in L2O very well (section 3). The imitation learning also has a natural context for L2O, i.e., mimicking analytical optimization algorithms who are guaranteed to be good (section 4). Another suggestion: this paper was still (understandably) benchmarked on a few simple optimizes that were typically used by previous L2O literature (line 215, i - v). It remains questionable whether L2O, combined with proposed new training techniques, can eventually scale up to larger models, even just LeNet (2 conv + 3 fc) or so. Would be good to include some results or discussions here.

Correctness: Yes. Yes.

Clarity: This paper is well written and easy to read.

Relation to Prior Work: Yes. This paper focuses on a totally complementary direction to previous L2O literature.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: this paper proposes a curriculum learning and imitation learning approach to improve the learning to optimize problem.

Strengths: 1. the method proposed is intuitive and practical, and the results seem promising 2. the paper tested the method on a range of tasks

Weaknesses: 1. the paper specifies difference from self-improving in section 4.2, but it's not clear why the proposed imitation learning is a better strategy 2. two strategies are proposed, but it is not clear from the experiments how much improvement comes from curriculum learning and how much comes from imitation learning

Correctness: yes it seems correct to me

Clarity: the paper is relatively clear, but there are some sections that are not well explained. it is not clear from the method section why certain design choices are made

Relation to Prior Work: it is discussed but i'm not an expert in the area, so I don't know if any work is missing.

Reproducibility: No

Additional Feedback: