Reviews: Combinatorial Bayesian Optimization using the Graph Cartesian Product

Beautiful approach to deal with categorical and integer variables in Bayesian Optimization that builds a graph with the combinations of these variables. Necessary approach to continue the research in making BO able to deal these variables. When I read papers such as this one I am willing for it to be accepted. I do not have much to say rather than good job. And I say this being an author that have proposed another approach in the same scenario. If we add the quality of the paper with the sensational supplementary material, the theoretical material added and the quality of the experiments, I heavily recommend for acceptance this paper. It is a paper that I will give a seminar to my colleagues without a doubt and maybe, who knows, expand it. Congratulations to the authors. The code would have been nice to be uploaded. I just wish to say to the authors that I would like to see an extension of this paper published in a journal, it would be very great.

This manuscript proposes a system for combinatorial Bayesian optimization called COMBO, aimed at problems with large numbers of categorical and/or ordinal features. The main contribution is an effective kernel for this setting based on applying a graph kernel to the graph Cartesian product of each of the features, which can be computed efficiently by exploiting structure. This kernel can be further enhanced using an ARD extension and a horseshoe prior to encourage sparse feature selection. The COMBO system then creates a GP with this kernel and does random + local search to maximize an acquisition function such as EI in the combinatorial space. A series of experiments demonstrate COMBO performing better on real and synthetic tasks than alternatives such as systems using one-hot encodings. Overall this is a well-written paper and was interesting to read. I think the proposed kernel may find use in Bayesian optimization or elsewhere, and would be a nice addition to a GP/BayesOpt software packages. There would certainly be interest in this contribution by attendees of the conference. That said I do have some concerns, mostly minor, that I will enumerate below in mostly linear (rather than importance) order: - I wish the manuscript could be more self-contained. There are a so many forward references to the supplementary material that it can be somewhat frustrating for the reader. I understand the page limits are sometimes oppressive, but for some of these I wish they were at least complemented by some intuition or summary to keep the flow. For example the use of a horseshoe prior is mentioned several times as a main contribution and responsible for a large increase in performance, but this is not really justified in the paper except a pointer to the supplement in line 277. It would be nice to say something about this result beyond "The Horseshoe prior helps COMBO..." Added after response: I wish you had addressed the above comment in your response. I continue to feel the manuscript as written is not as effective as it could be. - I wanted to point to a couple more papers using Bayesian optimization in (different) combinatorial spaces: - Garnett, et al. Bayesian Optimization for Sensor Set Selection. IPSN 2010. [optimization over subsets of a ground set] - Malkomes, et al. Bayesian optimization for automated model selection. NeurIPS 2016. [optimization over a graph defining GP kernels] - I personally think stating and setting off Theorem 2.2.1 as a theorem is overkill, as the result is in my opinion both intuitive and straightforward. It needlessly breaks the flow of the text. - I feel the same way about Proposition 2.3.1. I think the paper would be nicer to read if this were simply stated as a normal sentence. (Again I find this result unsurprising, albeit of course convenient!) - [*] Perhaps the biggest technical concern I had was about the optimization of the acquisition function in COMBO, which is of course also a combinatorial optimization problem. The authors simply do random search followed by local refinement. Although this will help identify good "exploration moves," it is not clear to me that this will help with "exploitation moves," when they are warranted, as it's not clear that you'd be able to stumble onto the part of the domain near the best-seen points using this method, where the optimum will lie in the latter case. Note that the two papers mentioned above deal with this problem by combining an exploration heuristic (effectively the same, random selection) with an exploitation heuristic (measuring the acquisition function at points near the best-seen observations). The fact that was not done here makes me wonder whether the improved performance of COMBO may simply be a result of biasing EI toward more exploration than it would normally do. The current experiments give no insight into this possibility. Added after response: thank you for your clarification. - Are the standard errors in Table 2 correct? The standard deviation for COMBO was only 0.006? Did it find the same point every time but once or something? Added after response: I still find it extremely suspicious that the standard error would be so absurdly small with no real explanation given. Yes there is no observation noise, but there are other sources of randomness that seem like they should contribute non-negligable variance. [getting very minor now] - Table 3 would be nicer to read if the -'s were typeset as negative signs ($-20.05$) rather than hyphens (-20.05) - The use of dashes in the manuscript --en dashes not set off by spaces on the side of the parenthetical comment-- is so idiosyncratic and nonstandard to the point of being jarring. - Please read over the references, there are some typos (e.g., gaussian).

Paper ID:	1678
Title:	Combinatorial Bayesian Optimization using the Graph Cartesian Product

Reviewer 1

Reviewer 2

Reviewer 3