Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
* The authors seem unaware of prior work on learning which hyperparameters are most important to tune and which values are most promising. For instance: https://arxiv.org/pdf/1710.04725.pdf learns priors over the hyperparameter space. * The method only works for completely numeric spaces. The numeric and non-numeric subspaces could be tuned separately, but then you can't learn interactions between both kinds of features, while a full Bayesian optimization approach could. * The results show excellent anytime performance: with limited resources, it provides good solutions fast. However, since the search space is restricted, it is possible that the very best configurations fall outside the search space, and that a more extensive search would yield a better model. This is not demonstrated since all experiments were cut off after 50 iterations. * the ellipsoid introduces an extra hyper-hyperparameter. How should it be tuned? * Generalizability is questionable. The toy problem is really small, I don't really see what to take away from these results. The OpenML experiments are larger (30 datasets) and up to 10 hyperparameters, but this is still rather small to other work in the literature, e.g. autosklearn with hundreds of hyperparameters and datasets. The experiments all included very densely sampled problems, it is unclear how accurate this method is if only sparse prior metadata is available. One sign that is troubling is that randomsearch seems to outperform more sample-efficient approaches easily in these experiments. * Why were only 4 meta-features used? Most prior work uses many more. * In is unclear why the hyperparameters were discretized in the neural network experiments. Update after the author feedback: Most of my concerns have been addressed. I still have some reservations about generalizability but the current experiments are promising.
[I have read the author rebuttal and upgraded my score. My concern about generalization still remains, and I hope the authors can devote maybe a sentence or two to it in the final draft - even something to the effect of "it is a concern; experimental evidence suggests it is not a great concern."] Summary: For any given ML algorithm, e.g., random forests, the paper proposes a transfer-learning approach for selection of hyperparameters (limited to those parameters that can be ordered) wherein a bounding space is constructed from previous evaluations of that algorithm on other datasets. Two types of bounding spaces are described. The box space is the tightest bounding box covering the best known hyperparameter settings for previous datasets. The ellipsoid is found as the smallest-volume ellipsoid covering the best known settings (via convex optimization). The box space can be utilized by both model-free Bayesian optimization (BO), e.g., random search or Hyperband, and model-based BO, e.g., GP, whereas the ellipsoid space can only be utilized by model-free BO. I liked the idea of transforming the problem of how to leverage historical performance for hyperparameter selection into one of how to learn an appropriate search space. The paper demonstrates that a relatively simple dataset-independent procedure provides performance gains over not transferring. This is also what I'm concerned about with the approach: what if a dataset itself is an outlier to previous datasets? In this case, why would we expect "successful" hyperparameter ranges on previous datasets to bound the right setting for the new dataset? Using the box space could be disastrous in this setting (while it could be argued that transfering does not help in this case, it could also be argued that this would not be known a priori). This leads me to believe the approach is not robust in its current form. The ellipsoid space may be more resilient in this regard, but it is limited to model-free methods. It would be better if the search space learning could incorporate dataset information such that graceful performance degradation would occur when attempting to transfer to a unique dataset for which historical performance fails to inform. I do think the finding that certain hyperparameter ranges for common ML algorithms should be tried first in hyperparameter selection is a good empirical finding, but not significant enough on its own. Originality: This reviewer is not aware of a similar proposal, but is not familiar with all the literature that falls into this category. Quality: The experimental section (and supplement) considered a large range of settings for the problem. Clarity: The authors have done a great job wrt/ organization, exposition, and results descriptions. There was only one place where I felt confused in an otherwise very readable paper. Does the experimental parameter n in lines 256-257 correspond to n_t in line 83? If not, please provide more detail on what it represents in the experiment. If so, how were the previous evaluations selected, and what is T for this experiment?
Summary of the main ideas: A really novel methodology to learn search spaces in Bayesian Optimization that is able to build ellipsoidal search spaces that are sampled through rejection sampling. I like this paper due to the fact that it seems to me after reading several papers about this topic this this methodology is the best, and most practical, to learn the search space in Bayesian Optimization. It has further work in the sense that can not be applied as it is to GPs but I think that using an approach such as for example PESMOC to model the ellipsoid as a constrained solves it easily (it is commented in the paper). I see it so clearly that I would even be willing to collaborate on that idea. The approach gains quality in the sense that it does not need parameters (critical in BO) in proposes several methods and they are simple yet effective. Brilliant paper from my point of view. Related to: -> Learning search spaces in BO. Strengths: Effective and simple methods that solve a popular task in BO more easily than previous methods. Proposes several alternatives. Proposes further work to address the only tasks that are pending. It seems to me that opens a new set of methodogies to be further improved. The paper is very well written. Weaknesses: We can argue that is not a complete piece of work, but it is veery rigurous. Lack of theoretical content, but it does not matter as the empirical content and the proposed methodologies are solid. Does this submission add value to the NIPS community? : Yes it does. I miss simple but effective methods for BO and this paper contains them. It contains further work to be done. It is an approach that I would use if I were working in a company, in the sense that it is easy and pragmatic. Quality: Is this submission technically sound?: Yes it is. Estimating the search space as an optimization problem and with different geometrical shapes in the sense that the paper performs is sound. Are claims well supported by theoretical analysis or experimental results?: Experimental results support the claims. Is this a complete piece of work or work in progress?: It is a work in progress though, GPs are not covered and are the most useful model in BO. But the path is set in this work. Are the authors careful and honest about evaluating both the strengths and weaknesses of their work?: I think so from my humble opinion. Clarity: Is the submission clearly written?: Yes it is, it is even didactic. Good paper that I would show to my students. Is it well organized?: Yes it is. Does it adequately inform the reader?: Yes it does. Originality: Are the tasks or methods new?: They are tools used in other problems but their application in BO is sound. Is the work a novel combination of well-known techniques?: Combination of well-known techniques, very well done. Is it clear how this work differs from previous contributions?: Yes it does. Is related work adequately cited?: Yes I think so. Significance: Are the results important?: I think so, they seem to speed up the optimization. Are others likely to use the ideas or build on them?: Yes, definitely, I would do it and I am thinking in contacting them after publication or rejection is done, I found this piece of work very interesting. Does the submission address a difficult task in a better way than previous work?: Yes absolutely. Does it advance the state of the art in a demonstrable way?: Yes it does. Does it provide unique data, unique conclusions about existing data, or a unique theoretical or experimental approach?: Yes, experiments provide good results. Arguments for acceptance: Effective and simple methods that solve a popular task in BO more easily than previous methods. Proposes several alternatives. Proposes further work to address the only tasks that are pending. It seems to me that opens a new set of methodogies to be further improved. The paper is very well written. Arguments against acceptance: Honestly, I would accept this paper, I know that NIPS has high standards, but I think that it is a good paper.