__ Summary and Contributions__: The authors extended optimization in the latent space of a generative model by weighting acquired samples and periodically retraining the generative model. Results on two synthetic and one real-world benchmark task show that weighting and retraining help to find the optimum of a function faster.

__ Strengths__: The paper is clearly written.

__ Weaknesses__: 1. Weighted retraining is not new. The cross-entropy method (De Boer et al., 2005; Neil et al., 2018) maximizes the expectation E_p(x)[f(x)] of the objective function f(x) when sampling from a policy p(x) by periodically retraining p(x) on the samples with the highest reward, e.g. those with a reward above a quantile cutoff (i.e. using a stepwise weighting function). Instantiations of the cross-entropy method include DbAs (Brooks et al) and FBGAN (Gupta et al). Reward weighted regression (RWR) (Hachiya et al) is another existing optimization technique that employs weighted retraining. Angermueller et al. (http://arxiv.org/abs/2006.03227) recently employed these techniques as baselines for high-dimensional discrete optimization.
2. The described rank-based weighting function is not new. See RankGAN (Lin et al. 2017) or LeakGAN (Guo et al. 2017) for an example.
3. The evaluation is missing important baselines such a DbAs, FBGAN, RWR, and model-based optimization.
4. Chemical design task:
It is unclear how the optimization trajectory of ‘original’ was obtained. How were new data points sampled from JT-VAE? Why does the trajectory stop at 250?
5. In addition to JT-VAE, I would also like to see a comparison with GCPN (You et al) and reinforcement learning.
6. What do error bars represent? How often were experiments repeated with different random seeds?

__ Correctness__: The methodology is correct.

__ Clarity__: The paper is clearly written.

__ Relation to Prior Work__: The related work section misses a broader discussion of cross-entropy optimization, reward weighted regression, and DbAs. Bayesian optimization has also been used for optimizing high-dimensional functions (ChemBO, http://arxiv.org/abs/2006.03227).

__ Reproducibility__: Yes

__ Additional Feedback__: How do you explain the sudden increase of ‘weight; retrain’ in Figure 4 (right) and the good performance of weighting without retraining in Figure 4 (left)?

__ Summary and Contributions__: The paper proposes a method to improve the general performance and sample efficiency for sample latent space optimization. Here the value of an expensive to evaluate black-box objective function is seeked maximized using as few evaluations of the black box functions as possible. The paper follows previous literature and does the optimization in a (learnable) latent space where the optimization is presumably easier. The main contributions of the paper to first identify that the structure of the latent space is normally decoupled from the black-box function and then propose two simple solutions to alleviate that problem: 1) reweight the training data according to the objective function and secondly to retrain the (DGM) latent space model.
# Author rebuttal: I thank the authors for their rebuttal and would like to acknowledge that I read and considered their answers!

__ Strengths__: Pros:
The paper tackles an important and often encountered problem of optimization of a general black box function.
The authors clearly identify a problem with the current previous literature and proposes a simple, almost trivial (in the good way), solution to the problem
The experimental results support that the proposed solution works well and are both more sample efficient as well as achieving better ‘objective function’ results.

__ Weaknesses__: Cons:
In general I found the method section ok, however some important parts are missing and need to be addressed.
“Fit objective model h” (pseudo algo line 6)
What is h and how is it fitted. You mention a gaussian process for the Zinc dataset - why is that model appropriate and how well does it actually fit the true objective function?
“suggest new latent z ̃ based on h” (pseudo algo line 6)
How do you find new latent space samples? Gradient descent on h?
Some of this information can likely be found in the refs or in the appendix however this information (in my opinion) really needs to be explained and self-contained in the main paper
It would strengthen the paper a lot if one more real world example were included in the experimental results (currently two toy tasks, one real world dataset).
I’m missing some analysis on the performance of the DGM models - .e.g how do the likelihood (ELBO) change due to the weighted training. Especially it could be interesting to see an experiment showing that the weighting+training actually decreases the likelihood of data points with low objective function values.
I found the discussion of the exact weighting implementation unnecessary confusing 143-146L. The chosen implementation where the data points are repeated according to the inverse weight seems to be a complex and approximate solution to simply sampling according to the (normalized weights) with replacement? (please confirm or correct me if I’m wrong and rewrite that section as appropriate)

__ Correctness__: Claims seems correct and sound

__ Clarity__: In general the paper is well written and easy to follow

__ Relation to Prior Work__: Nothing to add

__ Reproducibility__: Yes

__ Additional Feedback__: Wrt to the weaknesses i would like the authors to address the following
Rewrite parts of method section wrt to surrogate objective function h and how the optimization in the latent space is done do address the comments above - This is of high importance to the quality of the paper
Add analysis of the quality of the DGM models and how the weighting+retraining shifts the probability density.
Optionally add another real world dataset or discuss/address why the chosen datasets are sufficient
If the above are addressed I would be happy to bump my score

__ Summary and Contributions__: This paper proposes improvements to current latent space optimization (LSO) methods which train a mapping from a lower dimensional latent space to the true input space of the objective function allowing for the more efficient optimization of the latent space. The authors conjecture the way current LSO methods train and use the latent space mapping severely limits their effectiveness. They explain that this is due, at least in part, to training the deep generative model (DGM) on a large proportion of low performing examples as well as the lack of retraining to leverage new samples. They formulate some intuitive arguments for why this might be true by discussing the properties of what they call the "feasible region" representing the region in latent space containing most of the probability mass.
To alleviate these issues, the authors propose to regularly retrain on the weighted dataset. By adjusting the weights of individual data points based on their objective value, they show that they can leverage new data points, i.e., function evaluations, while avoiding catastrophic forgetting. Additionally, they argue that the weighted dataset allows the DGM to capture a large and higher-valued portion of the input space. Some empirical results are provided on a 2D shape area maximization task, an arithmetic expression fitting task, and a chemical design task.

__ Strengths__: This work seems well motivated and sound. While retraining and importance weighting aren't novel ideas, I consider the main contributions of this work to be come from identifying and isolating issues with how LSO is currently used. These types of contributions are relevant of the NeurIPS community and can have considerable impact.
That being said, I am not an expert in this field so I don't consider myself a good judge of the significance of this work or how this works fits into the existing literature.
The author's conjectures are intuitive and sound. The empirical evaluation supports the author's conjectures giving them some credence.

__ Weaknesses__: There are no obvious weaknesses along those axes that I can think off given my limited knowledge of this field.

__ Correctness__: The claims and method seem correct.
One major (but easily fixed) issue with figure 4 is the use of the standard error to display bounds on a sample mean computed with 3 samples. There is no reasonable expectation that this sample mean is normally distributed making the standard error a meaningless measure of uncertainty. For such low number of seeds, shading with the min/max or even plotting all three curves is much more meaningful. Alternatively, the empirical standard deviation could be used but, while not erroneous to use in this context like the standard error, it is arguably not very meaningful when summarizing 3 samples.

__ Clarity__: The paper is well written and very clear. I enjoyed reading it.

__ Relation to Prior Work__: This paper exhaustively discusses related work. While it would be unlikely for me to catch any omissions, I feel like I have a good understanding about the context surrounding this work.

__ Reproducibility__: Yes

__ Additional Feedback__: - Line 257, Given the training procedure is random, wouldn't their always be randomness to the optimization procedure from the trained mapping? Have the authors examined how much variance is introduced by their retraining approach?
- The meaning of the shaded areas and solid lines of figure 4-right is not mentioned. I assumed it was the same as for the left. If that is the case, the same comments apply.

__ Summary and Contributions__: The paper proposes a method to increase the sample efficiency of latent space optimization (LSO) over structured domains such as molecules. The two key ideas are to place greater weight on higher-scoring data points (with respect to some property) during training and repeatedly retrain the model with new data points queried during property optimization.
The paper identifies two limitations of the current LSO paradigm, proposes the two-part method above, and tests it on two toy tasks (binary shape area optimization and arithmetic expression fitting) as well as a molecule search task.

__ Strengths__: Good empirical evaluation:
The empirical results appear strong. There is little reason to expect that a general latent space learned by a generative model should necessarily be best-suited for property optimization. The evaluation on three tasks and associated ablation studies (with and without weighting and retraining), show that both components of the method contribute to the observed performance improvement relative to baselines.
Simple method:
The proposed method is quite simple and should be easy for other researchers to try in their pipelines.
Relevance:
I believe this work is relevant to the NeurIPS community. Optimizing structured/discrete objects via learned latent representations is a standard methodological route now, and this paper attempts to alleviate some of its current limitations.

__ Weaknesses__: Discussion:
The paper could have used a bit more analysis of the results with respect to the intuitive arguments presented in sections 3 and 4. For example, for the molecular search task, a molecule is found with 22.55 logP score. Where in the latent space would this molecule have been located for the generic baseline (i.e., no weighting or retraining). Is it actually outside the feasible region? If not, perhaps something else changes about the latent space that makes it more amenable to optimization?
Related work:
This is minor, but the related work section could be improved slightly with respect to Bayesian optimization (see related work section below).

__ Correctness__: The method and evaluation appear to be sound.

__ Clarity__: The paper is well-written.

__ Relation to Prior Work__: Yes overall. I believe the Bayesian optimization (BO) section of the related work could be adjusted to touch on some of the BO work that has occurred in the chemical design space. The authors make the claim that no efficient BO methods exist for this domain, but I'm not sure that's true. Several of the works cited in the paper do indeed incorporate BO for chemical design, and a discussion of their claimed limitations would be worthwhile.

__ Reproducibility__: Yes

__ Additional Feedback__: Can the authors comment on why they think the reweight only baseline for the arithmetic expression task recovered a substantial portion of the weight+retrain performance? This is not observed for the other two tasks.
I think some analysis of the kind mentioned in the weaknesses section would be interesting to see.
The method assumes one already has access to the ground truth property scores for the training set examples. I'm curious how the authors would go about adapting the method for cases where only a subset of the training set is labeled.
Added after the authors provided their rebuttal:
Thanks to the authors for addressing most of the questions. The authors provided a short analysis of the log prob of the best point as retrainings take place. I think this type of analysis is crucial to support the many intuitive arguments presented in the paper.
I still think the claim that no BO methods can handle the structured domains explored in the paper efficiently is a bit tenuous (e.g., JT-VAE has a BO demonstration). Even if the presented method outperforms others, this claim is too strong.
Overall I believe the paper meets the acceptance threshold. However, I do agree with some of the points raised by R1 regarding further comparisons to specific baselines being helpful for a more rigorous evaluation, so ideally the authors can address that.