__ Summary and Contributions__: The paper proposes an approach for generating discrete-structured objects with target properties. The approach is based on an RL formulation that attempts to circumvent the typical high variance of policy gradients.
An objective function is introduced that leverages an expectation wrt a normalized reward distribution over examples in the training set in order to train a conditional generative model (policy). Sampling from the model directly is not required during training.
Empirical evaluation on two tasks, generation of Python expressions evaluating to a certain target value and generation of molecules matching desired properties, demonstrates favorable performance relative to a model trained with a vanilla maximum-likelihood objective.

__ Strengths__: Relevant:
Targeted generation of structured data is an important and active area of research. The proposed method has the potential to make a decent contribution to this area.
Well-written methods:
The step-by-step description of the method is easy to follow.
Sound approach:
The high variance of policy gradients in RL is a standard obstacle to its deployment. With some slight re-formulating of the objective, the authors propose a setup that does not require any model sampling during training. Instead, samples are only drawn from the training set using pre-computed probabilities.

__ Weaknesses__: Motivation could be improved:
In the introduction, it is stated that a maximum likelihood objective is fundamentally a poor choice for property-conditional generation. This should be expanded upon. Why is this necessarily the case? In the beginning of the related work section, some intuition for this is briefly touched on ("there exist many candidates that are different from the ground truth but still acceptable"), but the related work occurs late in the paper.

__ Correctness__: The methodology appears correct.

__ Clarity__: The paper is well-written and easy to follow. As stated in the weaknesses, I believe a bit more justification of the immediate assumption that RL is necessarily a better choice than maximum likelihood would serve the introduction well.

__ Relation to Prior Work__: The related work section appears adequate.

__ Reproducibility__: Yes

__ Additional Feedback__: I believe the couple suggestions above would improve the presentation.
Added after the authors provided their rebuttal:
Overall, I'm happy with the author response and that they expanded upon the ML vs RL motivation, which was quite weak in the original text. I will keep my score of 7.

__ Summary and Contributions__: This work proposes a reinforcement learning approach to generate sequences of tokens that maximize a desired goal, training on sequence, value pairs. The key contribution is avoiding expensive Monte Carlo steps inherent to policy gradient optimization to maximize expected reward. Instead they utilized a normalized rewards distribution and a path-wise derivative estimator. The normalized rewards distribution is itself approximated from reweighting the training data distribution. In addition, to avoid mode collapse and maximize the sequence diversity of the outcomes, an entropy regularization term is added to the generated sequences.

__ Strengths__: The proposed strategy to bypass Monte Carlo sampling in policy gradient optimization is novel, at least in the molecular design arena (not so sure about other areas). It works to a reasonable degree in the tests shown and slightly overperforms over maximum likelihood while avoiding the computational cost of other RL approaches.
The entropy term is effective in bridging with the generative ability of ML approach. The comparison with baselines is appropriate, but not outstanding, and it includes some real-life molecules from ChemBL, synthetic exhaustively enumerated small molecules from QM9, and some comparisons with the Guacamol benchmark set of tasks.
There are strong comparisons with other approaches including data augmentation for Maximum Likelihood and Reward Augmented Maximum Likelihood, both of which underperform even the regular Maximum Likelihood approach.

__ Weaknesses__: The improvements over Maximum Likelihood are very moderate and no comparisons are made with more computationally expensive RL approaches (at least on the small QM9 dataset it would be interesting). It would be very interesting to see the performance tradeoff between the proposed approach and the Monte Carlo estimation of the expectation term in Eq 11.
One of the promising features of generative algorithms for molecules is their supposed ability to capture a complex statistical distribution of plausible molecules that can be made, paid for, stored in a vial, etc. The approximately 100 million molecules that have been made and the couple of billions that can be confidently said to be makeable are samples from that distribution. It is not clear how much of chemical space is in that manifold. There are many graphs that are formally valid according to valence rules (and their rdkit implementation) that could not exist as molecules because they are not stable.
The premise of using generative models for molecular design is sampling natural-looking molecules. Just like generative models for faces, one just needs to look at this to judge whether the model has learned a richer chemistry than the hard-coded rules of RDKit. Very few molecules are shown from what the model produces. It would be great to add samples to the SI and to the accompanying codebase.
Of the molecules that are shown, in Figure 1, 5 of them contain primary imine groups that are ¬extremely rare in actual chemistry. The model has not really learned chemical rules (an analogy would be training on celebrity faces and generating celebrity faces with crooked teeth and a mullet over and over). This is particularly interesting because QM9 is an exhaustive enumeration, that is, it contains all the chemically valid molecules (strictly, it’s a superset, it also contains non-chemical molecules). If a molecule with 9 heavy atoms is outside the training data then it is not a true molecule, and just an artifact of the SMILES notation / rdkit valence rules.
There are no examples of molecules coming fmor the ChemBL-trained model. Since that is a more natural training dataset (QM9 is a synthetic enumeration) the ability to produce natural-looking samples would be interesting to see.
RAML for molecules is interesting but of course it falls onto the same issues, the edit distance in smiles and distance in the manifold of chemical molecules are very different. Mutations from a genetic algorithm might be a more relevant way to edit molecules. (Chem. Sci., 2019,10, 3567-3572)

__ Correctness__: The number of significant digits in Table 5 is confusing? Are they because of the standard deviation? Since the standard deviations are not reported in this table, maybe a summary with fewer digits in the manuscript would be easier to track, and the full numbers be reported in the SI. For most of these chemical properties the decimas digits are meaningless.
“atom counter” in line 158 should be atom count?
“is generate” in line 65

__ Clarity__: Excellent, very good balance between qualitative explanation and derivations.

__ Relation to Prior Work__: Yes

__ Reproducibility__: Yes

__ Additional Feedback__: The author feedback addresses some of the challenges raised and applies filters to fix plausability issues. Since this is general issue for RL-based approaches, it does not demerit this contribution.

__ Summary and Contributions__: The paper addresses the task of conditional generation of discrete structures. The authors introduce an idea of scaling the conditional distribution by the reward R(x|y), which re-weights the samples based on the difference between target molecular features and those of the current sample. The authors test their approach on integer problems and molecule generation.
=================================
Added after author response: in the rebuttal the authors explained their motivation more clearly, applicability of other approaches (e.g. RL) and provided a comparison with a competitive baseline by Kang and Cho. They also pointed out the differences with the previous work by Norouzi et al, which was my main concern. This made me change my opinion of the paper, and I am changing my score to 7 (accept).

__ Strengths__: The idea is novel. This approach allows to compute the expected reward across the dataset, rather than samples from the model, avoiding gradient with high variance. The motivation for the approach is sound, and derivations are easy to follow.

__ Weaknesses__: The authors have not made the appropriate comparison to the existing approaches. The authors compared their approach to only two baselines, both of which are weak. The first is a naive baseline of directly training conditional model p(x|y). The second baseline is RAML augmentation, which is clearly underperforming (see Table 8). In the "Related work" authors point out that RAML augmentation is applicable only for text, but for molecules.
I would like to see the comparison to other conditional generative models for molecules. The authors mention several of such in "Related work" section, such as Gómez-Bombarelli et al, Kang et al. I have listed more example of conditional models for molecules below.
The approach constructing the conditional distribution thorough the scaled distances is novel for the area of generating molecules, but is quite similar to Norouzi et al. The relation to Norouzi et al. should be clearly stated. The authors compared their approach to only one baseline, which is training a conditional model p(x|y). No empirical comparison to other related work is provided. Because of that, it is hard to evaluate the usefulness of the approach.

__ Correctness__: The claims and derivations are correct.
In Table 5 it is clear that the features have different scales, and some of them are discrete. The model seems to use the same mean-squared error loss on all the features, the model might optimize for features with larger scale. Was any normalization technique applied to preprocess the features and bring them on the same scale?

__ Clarity__: The paper is clearly written and easy to follow.

__ Relation to Prior Work__: The current paper uses a very similar methodology to Norouzi et al., 2016 (which is mentioned in the paper), where the conditional model is scaled by the normalized exponentiated distances between the samples. Adding the entropy term also follows Norouzi et al. Can authors provide a more detailed explanation how their work differs from Norouzi et al., besides the application area?
The authors mention several works for conditional generation of molecules and text. There is a variety of models for conditional generation of molecules that are worth mentioning:
Generation of molecules as graphs: Cao et al, 2018 [1], Li et al [2], Li et al. [3],
Generation of molecules as SMILES: Lim et al, 2018 [4], Polykovskiy et al. [5], Li et al [2] (used LSTM baseline)
[1] MolGAN: An implicit generative model for small molecular graphs. Nicola De Cao, Thomas Kipf, 2018
[2] Learning Deep Generative Models of Graphs. Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, Peter Battaglia, 2018
[3] Multi-objective de novo drug design with conditional graph generative model. Yibo Li, Liangren Zhang & Zhenming Liu. Journal of Cheminformatics (2018)
[4] Molecular generative model based on conditional variational autoencoder for de novo molecular design
Jaechang Lim, Seongok Ryu, Jin Woo Kim & Woo Youn Kim. Journal of Cheminformatics (2018)
[5] Daniil Polykovskiy, Alexander Zhebrak, Dmitry Vetrov, Yan Ivanenkov, Vladimir Aladinskiy, Polina Mamoshina, Marine Bozdaganyan, Alexander Aliper, Alex Zhavoronkov, Artur Kadurin. Entangled Conditional Adversarial Autoencoder for de Novo Drug Discovery. Molecular Pharmaceutics, 2018

__ Reproducibility__: Yes

__ Additional Feedback__: It seems that the description for the proposal distribution for q(x| x*, tau) is missing. Can authors clarify how this distribution was constructed?

__ Summary and Contributions__: This paper studies the goal-directed generation of structured discrete data problem.
It introduces an approach to directly optimize a reinforcement learning objective, that encourages the generation of sequences with specific desired properties.
The experiments show that the model is effective to tackle the problem.

__ Strengths__: The proposed idea is sound, although intuitive.

__ Weaknesses__: Presentation and writing of the paper are obscure, making it hard to follow.
The contribution is not well- highlighted.
The proposed model is simply compared with one baseline, i.e. ML. I am afraid this may not be the first time RL is adopted for structured discrete data generation tasks, i.e. NLP.
In this aspect, the contribution of this work may not be enough.
The author may need to further highlight their contrition in the rebuttal.

__ Correctness__: The formulation is sound to me.

__ Clarity__: The presentation and writing of the paper are obscure, making it hard to follow.
The main contribution is not evident from the current presentation.
=====================
I have thoroughly read the author rebuttal. The response addresses my previous concern regarding its motivation and experiments design:
Its primary motivation is to beat MLE respect to diverse structured data generation and to save computation cost. That is why they adopt an RL formulation. The proposed RL model is demonstrated to outperform the methods, and it is comparable with other RL method in complex data generation tasks in the Appendix.
I have increased my score after the rebuttal.
However, I still think the paper needs to be improved in the final version, e.g. detailed motivation statement as in response [R1: Maximum likelihood (ML) VS RL objectives ]. Otherwise, it is hard to understand its motivation from Line 33-44. The logic is not much convictive here.

__ Relation to Prior Work__: The work lacks a sufficient discussion and comparison with other RL works for discrete data. Even they may do not target generation with designed property, they should be briefly discussed to reduce confusions.

__ Reproducibility__: Yes

__ Additional Feedback__: This paper studies the generation of discrete data with specific structured constraints.
It motivates to formulate this conditional generation problem in a reinforcement learning setting, namely to learn a stochastic policy p_{\theta}(x | y). It further proposes to sampling from an approximation to the normalized rewards to overcome the requirement of high-variance score-function estimators in existing RL methods.
In my aspect, the contribution of this work is not very clear.
Is the main contribution/ novelty here, the structured *conditional *generation of discrete data? I mean "to generate discrete data *with specific structured constraints*". Are you the first one to considering this setting? My confusion is that RL methods have been widely adopted in structured discrete data generation problems, e.g. in NLP [1]. However, the paper does not present any discussion on connection or comparison regarding RL methods.
The author may need to clearly highlight their motivation in the rebuttal.
[1] Fedus, William, Ian Goodfellow, and Andrew M. Dai. "MaskGAN: Better text generation via filling in the_." arXiv preprint arXiv:1801.07736 (2018).