Review for NeurIPS paper: Bayesian Optimization for Iterative Learning

NeurIPS 2020

Bayesian Optimization for Iterative Learning

Review 1

Summary and Contributions: This paper aims to increase the efficiency of BO on DL/DRL, which are both expensive and challenging to tune. The main contributions of this paper are a comparison on different DRL tuning tasks and utilizing and also learning a training curve compression.

Strengths: I find the paper interesting and practical. The experimental section is convincing and well-executed and the steps of the method are clearly explained and seem different from previous work. Fig. 5 is does a good job convincing the reader that the final parameters are more stable.

Weaknesses: Ill-conditioning: Section 3.2 seems to provides a non-intuitive solution to the underlying problem. Selecting a subset of the points through active learning seems unnecessary and it would be better to understand what causes the conditioning problems. Have you experimented with different kernels and how does this affect the issue? Baselines: I would still like to see a comparison to FABOLAS, Freeze-Thaw, MISO-KG, and BOHB to further strengthen the experimental evaluation.

Correctness: Yes

Clarity: The paper is well-written and easy to read.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: === Post-rebuttal update === Thank you for your response. The ill-conditioning likely comes from points being close together, but the decision to use one of the smoothest kernels (SE) available is questionable in this case. It would be interesting to explore a less smooth kernel in combination with removing points that are close to each other. While I still think that the active learning approach is an unnecessarily complicated solution to the problem, it seems to work. I agree with the other reviewers that this paper is borderline and offers interesting empirical results while lacking novelty and theory.

Review 2

Summary and Contributions: This paper proposes a BO method to efficiently learn the hyperparameters of deep RL systems. Their BO algorithm uses a new objective function that together with a sample selection helps improve the learning process.

Strengths: - the paper is well-written and well-structured. the abstract and introduction read very well. - I like the idea, it is novel and it has been well investigated through experiments - the paper does a great job referring to previous work of art and comparing to them

Weaknesses: I am not convinced from the experiments that the method is significantly better than other learning methods. For example, in figure 6, the performance of BOIL is only marginally better than other methods. The authors mention BOIL is much faster but then the evaluation time seem to also only offer marginal improvement.

Correctness: yes

Clarity: yes, very well-written all throughout

Relation to Prior Work: yes, the author do a great comprehensive job of comparing to other methods.

Reproducibility: Yes

Additional Feedback: Thank you for the response. That clears things up.

Review 3

Summary and Contributions: This paper presents a method for improving deep reinforcement learning hyperparameter tuning by: building a surrogate model that incorporates iteration number (similarly to a fidelity variable), predicting the query cost, compress the reward curve in a single value, augment the observations by a curve fit.

Strengths: The experimental evaluation is very correct. The experiments are interesting enough while remaining tracktable and reproducible. There is an ablation study. The state-of-the-art is properly compared.

Weaknesses: Despite the conclusions by the authors, it seems that this approach is very limited to DRL problems. For example, the authors claim that they cannot use Freeze-Thaw or Fabolas because they assume exponential decay, but in this work, a logistic function is assumed instead. Therefore, it might be applicable only if the logistic approximation is a good compression.

Correctness: This is mostly a practical paper and the empirical methodology is one of the strong points.

Clarity: The paper is hard to follow at some points. For example, Section 3.2 and 3.3 should be reversed as Section 3.2 makes continuous reference to eq 7, and even in the algorithm, the order is the opposite.

Relation to Prior Work: The related work section is one of the most clear part of the text. All the related work is properly cited, discussed and compared if needed. If the comparison is not possible, it is justified.

Reproducibility: Yes

Additional Feedback: Update: After reading the authors response and the reviews, I have updated my score. I think the authors should add Fig 1 of their response to the supplementary material, as it makes quite clear and intuitive the design and purpose of the method. However, the authors response has increased my concerns of the method being truly useful for DRL applications. Although this is not a drawback "per se", I think the manuscript would be much stronger if it embraced that "limitation" and presented it as a integrated DRL strategy. ----------------------------------- -The comment on ill-conditioning when points get closer, that is well known in the BO community since early 2000s and that is one of the reasons to avoid the squared-exponential kernel. See for example: M.J. Sasena. Flexibility and Efficiency Enhancement for Constrained Global Design Optimization with Kriging Approximations. PhD thesis, University of Michigan, 2002. -How is the reward curve considered in the CNN case? Was it also compressed using a logistic curve? Given that the initial noise in that case should be smaller than in the DRL cases, it is surprising that BO perform worse while it does ok in the DRL. -The authors cite both [19] and [6] as Hyperband, but I assume they only use the random case [19] in the comparison. Given the competitiveness of Hyperband in all cases (and when it fails, BO does excellent), it seems that the combination of both as in [6] would be a tough competitor.

Review 4

Summary and Contributions: The paper presents a practical approach that tries to circunvemnt the computational cost of hyperparametrizing learning algorithms with an iterative structure by transforming each training curve into numeric scores representing training success as well as stability and then selectively augment the data using the information from the curve. They provide a good experimental framework that provides enough empirical claims to support their arguments with examples on deep reinforcement learning and convolutional neural networks. The paper resembles freeze-thaw bayesian optimization, but I prefer this method. I think that the paper can be presented in the NeurIPS conference, given that the observations are satisfied.

Strengths: Significance Empirical Work Reproducibility (If code given)

Weaknesses: Theoretical work Novelty (Freeze Thaw Bayes Opt.)

Correctness: Yes, I think

Clarity: Yes, it is.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: Author rebuttal: I have read the author's response and talk with the other reviewers about this paper. After that, I keep my score as my most important demands were not met and reviewers conclude in giving an average of 6 for this paper. ========= -> I would have like a bit of theoretical work on the paper, are not they any simple bounds that you can stablish with respect to vanilla BO? -> I also miss a computational complexity analysis of the BOIL algorithm. -> I expect that the authors could provide the source code as it is stated in the paper in the review process. -> -> Reinforcement learning literature in the introduction is too wide and too informal. I would formalize it a bit for BO researchers that do not have experience in Reinforcement learning and make it less "narrative". The algorithm then would make more sense. You could for example give a formal representation of the iterative learning algorithms where BOIL are effective, described in an abstract way, making DRL as a concrete example of this algorithms. It would give quality to the paper -> I was familiar with Freeze Thaw Bayes. Opt. but have not considered that it was not optimal for this case. Could you provide a simple maybe synthetic or toy experiment where this can be shown? It would give a lot of points to the paper since Freeze Thaw is so popular and augment its credibility.