{"title": "Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation", "book": "Advances in Neural Information Processing Systems", "page_first": 6410, "page_last": 6421, "abstract": "Generating novel graph structures that optimize given objectives while obeying some given underlying rules is fundamental for chemistry, biology and social science research. This is especially important in the task of molecular graph generation, whose goal is to discover novel molecules with desired properties such as drug-likeness and synthetic accessibility, while obeying physical laws such as chemical valency. However, designing models that finds molecules that optimize desired properties while incorporating highly complex and non-differentiable rules remains to be a challenging task. Here we propose Graph Convolutional Policy Network (GCPN), a general graph convolutional network based model for goal-directed graph generation through reinforcement learning. The model is trained to optimize domain-specific rewards and adversarial loss through policy gradient, and acts in an environment that incorporates domain-specific rules. Experimental results show that GCPN can achieve 61% improvement on chemical property optimization over state-of-the-art baselines while resembling known molecules, and achieve 184% improvement on the constrained property optimization task.", "full_text": "Graph Convolutional Policy Network for\n\nGoal-Directed Molecular Graph Generation\n\nJiaxuan You1\u2217\n\njiaxuan@stanford.edu\n\nBowen Liu2\u2217\n\nliubowen@stanford.edu\n\nRex Ying1\n\nrexying@stanford.edu\n\nVijay Pande3\n\npande@stanford.edu\n\nJure Leskovec1\n\njure@cs.stanford.edu\n\n1Department of Computer Science, 2Department of Chemistry, 3Department of Bioengineering\n\nStanford University\nStanford, CA, 94305\n\nAbstract\n\nGenerating novel graph structures that optimize given objectives while obeying\nsome given underlying rules is fundamental for chemistry, biology and social\nscience research. This is especially important in the task of molecular graph\ngeneration, whose goal is to discover novel molecules with desired properties such\nas drug-likeness and synthetic accessibility, while obeying physical laws such as\nchemical valency. However, designing models to \ufb01nd molecules that optimize\ndesired properties while incorporating highly complex and non-differentiable rules\nremains to be a challenging task. Here we propose Graph Convolutional Policy\nNetwork (GCPN), a general graph convolutional network based model for goal-\ndirected graph generation through reinforcement learning. The model is trained\nto optimize domain-speci\ufb01c rewards and adversarial loss through policy gradient,\nand acts in an environment that incorporates domain-speci\ufb01c rules. Experimental\nresults show that GCPN can achieve 61% improvement on chemical property\noptimization over state-of-the-art baselines while resembling known molecules,\nand achieve 184% improvement on the constrained property optimization task.\n\n1\n\nIntroduction\n\nMany important problems in drug discovery and material science are based on the principle of\ndesigning molecular structures with speci\ufb01c desired properties. However, this remains a challenging\ntask due to the large size of chemical space. For example, the range of drug-like molecules has been\nestimated to be between 1023 and 1060 [32]. Additionally, chemical space is discrete, and molecular\nproperties are highly sensitive to small changes in the molecular structure [21]. An increase in\nthe effectiveness of the design of new molecules with application-driven goals would signi\ufb01cantly\naccelerate developments in novel medicines and materials.\nRecently, there has been signi\ufb01cant advances in applying deep learning models to molecule generation\n[15, 38, 7, 9, 22, 4, 31, 27, 34, 42]. However, the generation of novel and valid molecular graphs that\ncan directly optimize various desired physical, chemical and biological property objectives remains to\nbe a challenging task, since these property objectives are highly complex [37] and non-differentiable.\nFurthermore, the generation model should be able to actively explore the vast chemical space, as the\ndistribution of the molecules that possess those desired properties does not necessarily match the\ndistribution of molecules from existing datasets.\n\u2217The two \ufb01rst authors made equal contributions.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fPresent Work. In this work, we propose Graph Convolutional Policy Network (GCPN), an approach\nto generate molecules where the generation process can be guided towards speci\ufb01ed desired objectives,\nwhile restricting the output space based on underlying chemical rules. To address the challenge of\ngoal-directed molecule generation, we utilize and extend three ideas, namely graph representation,\nreinforcement learning and adversarial training, and combine them in a single uni\ufb01ed framework.\nGraph representation learning is used to obtain vector representations of the state of generated graphs,\nadversarial loss is used as reward to incorporate prior knowledge speci\ufb01ed by a dataset of example\nmolecules, and the entire model is trained end-to-end in the reinforcement learning framework.\nGraph representation. We represent molecules directly as molecular graphs, which are more robust\nthan intermediate representations such as simpli\ufb01ed molecular-input line-entry system (SMILES)\n[40], a text-based representation that is widely used in previous works [9, 22, 4, 15, 38, 27, 34]. For\nexample, a single character perturbation in a text-based representation of a molecule can lead to\nsigni\ufb01cant changes to the underlying molecular structure or even invalidate it [30]. Additionally,\npartially generated molecular graphs can be interpreted as substructures, whereas partially generated\ntext representations in many cases are not meaningful. As a result, we can perform chemical checks,\nsuch as valency checks, on a partially generated molecule when it is represented as a graph, but not\nwhen it is represented as a text sequence.\nReinforcement learning. A reinforcement learning approach to goal-directed molecule generation\npresents several advantages compared to learning a generative model over a dataset. Firstly, desired\nmolecular properties such as drug-likeness [1, 29] and molecule constraints such as valency are\ncomplex and non-differentiable, thus they cannot be directly incorporated into the objective function\nof graph generative models. In contrast, reinforcement learning is capable of directly representing\nhard constraints and desired properties through the design of environment dynamics and reward\nfunction. Secondly, reinforcement learning allows active exploration of the molecule space beyond\nsamples in a dataset. Alternative deep generative model approaches [9, 22, 4, 16] show promising\nresults on reconstructing given molecules, but their exploration ability is restricted by the training\ndataset.\nAdversarial training. Incorporating prior knowledge speci\ufb01ed by a dataset of example molecules\nis crucial for molecule generation. For example, a drug molecule is usually relatively stable in\nphysiological conditions, non toxic, and possesses certain physiochemical properties [28]. Although\nit is possible to hand code the rules or train a predictor for one of the properties, precisely representing\nthe combination of these properties is extremely challenging. Adversarial training addresses the\nchallenge through a learnable discriminator adversarially trained with a generator [10]. After the\ntraining converges, the discriminator implicitly incorporates the information of a given dataset and\nguides the training of the generator.\nGCPN is designed as a reinforcement learning agent (RL agent) that operates within a chemistry-\naware graph generation environment. A molecule is successively constructed by either connecting a\nnew substructure or an atom with an existing molecular graph or adding a bond to connect existing\natoms. GCPN predicts the action of the bond addition, and is trained via policy gradient to optimize\na reward composed of molecular property objectives and adversarial loss. The adversarial loss is\nprovided by a graph convolutional network [20, 5] based discriminator trained jointly on a dataset\nof example molecules. Overall, this approach allows direct optimization of application-speci\ufb01c\nobjectives, while ensuring that the generated molecules are realistic and satisfy chemical rules.\nWe evaluate GCPN in three distinct molecule generation tasks that are relevant to drug discovery\nand materials science: molecule property optimization, property targeting and conditional property\noptimization. We use the ZINC dataset [14] to provide GCPN with example molecules, and train\nthe policy network to generate molecules with high property score, molecules with a pre-speci\ufb01ed\nrange of target property score, or molecules containing a speci\ufb01c substructure while having high\nproperty score. In all tasks, GCPN achieves state-of-the-art results. GCPN generates molecules with\nproperty scores 61% higher than the best baseline method, and outperforms the baseline models in\nthe constrained optimization setting by 184% on average.\n\n2 Related Work\n\nYang et al. [42] and Olivecrona et al. [31] proposed a recurrent neural network (RNN) SMILES string\ngenerator with molecular properties as objective that is optimized using Monte Carlo tree search\n\n2\n\n\fand policy gradient respectively. Guimaraes et al. [27] and Sanchez-Lengeling et al. [34] further\nincluded an adversarial loss to the reinforcement learning reward to enforce similarity to a given\nmolecule dataset. In contrast, instead of using a text-based molecular representation, our approach\nuses a graph-based molecular representation, which leads to many important bene\ufb01ts as discussed\nin the introduction. Jin et al. [16] proposed to use a variational autoencoder (VAE) framework,\nwhere the molecules are represented as junction trees of small clusters of atoms. This approach can\nonly indirectly optimize molecular properties in the learned latent embedding space before decoding\nto a molecule, whereas our approach can directly optimize molecular properties of the molecular\ngraphs. You et al. [43] used an auto-regressive model to maximize the likelihood of the graph\ngeneration process, but it cannot be used to generate attributed graphs. Li et al. [25] and Li et al.\n[26] described sequential graph generation models where conditioning labels can be incorporated\nto generate molecules whose molecular properties are close to speci\ufb01ed target scores. However,\nthese approaches are also unable to directly perform optimization on desired molecular properties.\nOverall, modeling the goal-directed graph generation task in a reinforcement learning framework is\nstill largely unexplored.\n\n3 Proposed Method\n\nIn this section we formulate the problem of graph generation as learning an RL agent that iteratively\nadds substructures and edges to the molecular graph in a chemistry-aware environment. We describe\nthe problem de\ufb01nition, the environment design, and the Graph Convolutional Policy Network that\npredicts a distribution of actions which are used to update the graph being generated.\n\nthere exists an edge of type i between nodes j and k, and A =(cid:80)b\n\n3.1 Problem De\ufb01nition\nWe represent a graph G as (A, E, F ), where A \u2208 {0, 1}n\u00d7n is the adjacency matrix, and F \u2208 Rn\u00d7d\nis the node feature matrix assuming each node has d features. We de\ufb01ne E \u2208 {0, 1}b\u00d7n\u00d7n to be the\n(discrete) edge-conditioned adjacency tensor, assuming there are b possible edge types. Ei,j,k = 1 if\ni=1 Ei. Our primary objective is\nto generate graphs that maximize a given property function S(G) \u2208 R, i.e., maximize EG(cid:48)[S(G(cid:48))],\nwhere G(cid:48) is the generated graph, and S could be one or multiple domain-speci\ufb01c statistics of interest.\nIt is also of practical importance to constrain our model with two main sources of prior knowledge.\n(1) Generated graphs need to satisfy a set of hard constraints. (2) We provide the model with a set of\nexample graphs G \u223c pdata(G), and would like to incorporate such prior knowledge by regularizing\nthe property optimization objective with EG,G(cid:48)[J(G, G(cid:48))] under distance metric J(\u00b7,\u00b7). In the case of\nmolecule generation, the set of hard constraints is described by chemical valency while the distance\nmetric is an adversarially trained discriminator.\n\nFigure 1: An overview of the proposed iterative graph generation method. Each row corresponds to\none step in the generation process. (a) The state is de\ufb01ned as the intermediate graph Gt, and the set\nof scaffold subgraphs de\ufb01ned as C is appended for GCPN calculation. (b) GCPN conducts message\npassing to encode the state as node embeddings then produce a policy \u03c0\u03b8. (c) An action at with 4\ncomponents is sampled from the policy. (d) The environment performs a chemical valency check on\nthe intermediate state, and then returns (e) the next state Gt+1 and (f) the associated reward rt.\n\n3\n\n\f3.2 Graph Generation as Markov Decision Process\n\nas a state transition distribution: p(st+1|st, ..., s0) =(cid:80)\n\nA key task for building our model is to specify a generation procedure. We designed an iterative\ngraph generation process and formulated it as a general decision process M = (S,A, P, R, \u03b3), where\nS = {si} is the set of states that consists of all possible intermediate and \ufb01nal graphs, A = {ai} is\nthe set of actions that describe the modi\ufb01cation made to current graph at each time step, P is the\ntransition dynamics that speci\ufb01es the possible outcomes of carrying out an action, p(st+1|st, ...s0, at).\nR(st) is a reward function that speci\ufb01es the reward after reaching state st, and \u03b3 is the discount factor.\nThe procedure to generate a graph can then be described by a trajectory (s0, a0, r0, ..., sn, an, rn),\nwhere sn is the \ufb01nal generated graph. The modi\ufb01cation of a graph at each time step can be viewed\np(at|st, ...s0)p(st+1|st, ...s0, at), where\np(at|st, ...s0) is usually represented as a policy network \u03c0\u03b8.\nRecent graph generation models add nodes and edges based on the full trajectory (st, ..., s0) of the\ngraph generation procedure [43, 25] using recurrent units, which tends to \u201cforget\u201d initial steps of\ngeneration quickly. In contrast, we design a graph generation procedure that can be formulated as a\nMarkov Decision Process (MDP), which requires the state transition dynamics to satisfy the Markov\nproperty: p(st+1|st, ...s0) = p(st+1|st). Under this property, the policy network only needs the\nintermediate graph state st to derive an action (see Section 3.4). The action is used by the environment\nto update the intermediate graph being generated (see Section 3.3).\n\nat\n\n3.3 Molecule Generation Environment\n\nto be added during graph generation and the collection is de\ufb01ned as C =(cid:83)s\n\nIn this section we discuss the setup of molecule generation environment. On a high level, the\nenvironment builds up a molecular graph step by step through a sequence of bond or substructure\naddition actions given by GCPN. Figure 1 illustrates the 5 main components that come into play in\neach step, namely state representation, policy network, action, state transition dynamics and reward.\nNote that this environment can be easily extended to graph generation tasks in other settings.\nState Space. We de\ufb01ne the state of the environment st as the intermediate generated graph Gt at\ntime step t, which is fully observable by the RL agent. Figure 1 (a)(e) depicts the partially generated\nmolecule state before and after an action is taken. At the start of generation, we assume G0 contains\na single node that represents a carbon atom.\nAction Space. In our framework, we de\ufb01ne a distinct, \ufb01xed-dimension and homogeneous action\nspace amenable to reinforcement learning. We design an action analogous to link prediction, which\nis a well studied realm in network science. We \ufb01rst de\ufb01ne a set of scaffold subgraphs {C1, . . . , Cs}\ni=1 Ci. Given a graph Gt\nat step t, we de\ufb01ne the corresponding extended graph as Gt \u222a C. Under this de\ufb01nition, an action\ncan either correspond to connecting a new subgraph Ci to a node in Gt or connecting existing nodes\nwithin graph Gt. Once an action is taken, the remaining disconnected scaffold subgraphs are removed.\nIn our implementation, we adopt the most \ufb01ne-grained version where C consists of all b different\nsingle node graphs, where b denotes the number of different atom types. Note that C can be extended\nto contain molecule substructure scaffolds [16], which allows speci\ufb01cation of preferred substructures\nat the cost of model \ufb02exibility. In Figure 1(b), a link is predicted between the green and yellow atoms.\nWe will discuss in detail the link prediction algorithm in Section 3.4.\nState Transition Dynamics. Domain-speci\ufb01c rules are incorporated in the state transition dynamics.\nThe environment carries out actions that obey the given rules. Infeasible actions proposed by the\npolicy network are rejected and the state remains unchanged. For the task of molecule generation, the\nenvironment incorporates rules of chemistry. In Figure 1(d), both actions pass the valency check, and\nthe environment updates the (partial) molecule according to the actions. Note that unlike a text-based\nrepresentation, the graph-based molecular representation enables us to perform this step-wise valency\ncheck, as it can be conducted even for incomplete molecular graphs.\nReward design. Both intermediate rewards and \ufb01nal rewards are used to guide the behaviour of\nthe RL agent. We de\ufb01ne the \ufb01nal rewards as a sum over domain-speci\ufb01c rewards and adversarial\nrewards. The domain-speci\ufb01c rewards consist of the (combination of) \ufb01nal property scores, such\nas octanol-water partition coef\ufb01cient (logP), druglikeness (QED) [1] and molecular weight (MW).\nDomain-speci\ufb01c rewards also include penalization of unrealistic molecules according to various\ncriteria, such as excessive steric strain and the presence of functional groups that violate ZINC\nfunctional group \ufb01lters [14]. The intermediate rewards include step-wise validity rewards and\n\n4\n\n\fadversarial rewards. A small positive reward is assigned if the action does not violate valency rules,\notherwise a small negative reward is assigned. As an example, the second row of Figure 1 shows the\nscenario that a termination action is taken. When the environment updates according to a terminating\naction, both a step reward and a \ufb01nal reward are given, and the generation process terminates.\nTo ensure that the generated molecules resemble a given set of molecules, we employ the Generative\nAdversarial Network (GAN) framework [10] to de\ufb01ne the adversarial rewards V (\u03c0\u03b8, D\u03c6)\n\nmin\n\n\u03b8\n\nmax\n\n\u03c6\n\nV (\u03c0\u03b8, D\u03c6) = Ex\u223cpdata [log D\u03c6(x)] + Ex\u223c\u03c0\u03b8 [log D\u03c6(1 \u2212 x)]\n\n(1)\n\nwhere \u03c0\u03b8 is the policy network, D\u03c6 is the discriminator network, x represents an input graph, pdata\nis the underlying data distribution which de\ufb01ned either over \ufb01nal graphs (for \ufb01nal rewards) or\nintermediate graphs (for intermediate rewards). However, only D\u03c6 can be trained with stochastic\ngradient descent, as x is a graph object that is non-differentiable with respect to parameters \u03c6. Instead,\nwe use \u2212V (\u03c0\u03b8, D\u03c6) as an additional reward together with other rewards, and optimize the total\nrewards with policy gradient methods [44] (Section 3.5). The discriminator network employs the\nsame structure of the policy network (Section 3.4) to calculate the node embeddings, which are then\naggregated into a graph embedding and cast into a scalar prediction.\n\n3.4 Graph Convolutional Policy Network\n\nHaving illustrated the graph generation environment, we outline the architecture of GCPN, a policy\nnetwork learned by the RL agent to act in the environment. GCPN takes the intermediate graph Gt\nand the collection of scaffold subgraphs C as inputs, and outputs the action at, which predicts a new\nlink to be added, as described in Section 3.3.\nComputing node embeddings. In order to perform link prediction in Gt \u222a C, our model \ufb01rst\ncomputes the node embeddings of an input graph using Graph Convolutional Networks (GCN)\n[20, 5, 18, 36, 8], a well-studied technique that achieves state-of-the-art performance in representation\nlearning for molecules. We use the following variant that supports the incorporation of categorical\nedge types. The high-level idea is to perform message passing over each edge type for a total of L\nlayers. At the lth layer of the GCN, we aggregate all messages from different edge types to compute\nthe next layer node embedding H (l+1) \u2208 R(n+c)\u00d7k, where n, c are the sizes of Gt and C respectively,\nand k is the embedding dimension. More concretely,\n\u2212 1\ni H (l)W (l)\n\nwhere Ei is the ith slice of edge-conditioned adjacency tensor E, and \u02dcEi = Ei + I; \u02dcDi =(cid:80)\n\nH (l+1) = AGG(ReLU({ \u02dcD\n\ni },\u2200i \u2208 (1, ..., b)))\n\ni\n\n\u02dcEijk.\nW (l)\nis a trainable weight matrix for the ith edge type, and H (l) is the node representation\nlearned in the lth layer. We use AGG(\u00b7) to denote an aggregation function that could be one of\n{MEAN, MAX, SUM, CONCAT} [12]. This variant of the GCN layer allows for parallel implementa-\ntion while remaining expressive for aggregating information among different edge types. We apply a\nL layer GCN to the extended graph Gt \u222a C to compute the \ufb01nal node embedding matrix X = H (L).\nAction prediction. The link prediction based action at at time step t is a concatenation of four com-\nponents: selection of two nodes, prediction of edge type, and prediction of termination. Concretely,\neach component is sampled according to a predicted distribution governed by Equation 3 and 4.\n\nk\n\n\u2212 1\ni\n\n2\n\n\u02dcEi \u02dcD\n\n2\n\n(2)\n\nat = CONCAT(a\ufb01rst, asecond, aedge, astop)\n\nf\ufb01rst(st) = SOFTMAX(mf (X)),\nfsecond(st) = SOFTMAX(ms(Xafirst, X)),\nfedge(st) = SOFTMAX(me(Xafirst, Xasecond )),\nfstop(st) = SOFTMAX(mt(AGG(X))),\n\na\ufb01rst \u223c f\ufb01rst(st) \u2208 {0, 1}n\nasecond \u223c fsecond(st) \u2208 {0, 1}n+c\naedge \u223c fedge(st) \u2208 {0, 1}b\nastop \u223c fstop(st) \u2208 {0, 1}\n\n(3)\n\n(4)\n\nWe use mf to denote a Multilayer Perceptron (MLP) that maps Z0:n \u2208 Rn\u00d7k to a Rn vector, which\nrepresents the probability distribution of selecting each node. The information from the \ufb01rst selected\nnode a\ufb01rst is incorporated in the selection of the second node by concatenating its embedding Zafirst\n\n5\n\n\fwith that of each node in Gt \u222a C. The second MLP ms then maps the concatenated embedding\nto the probability distribution of each potential node to be selected as the second node. Note that\nwhen selecting two nodes to predict a link, the \ufb01rst node to select, a\ufb01rst, should always belong to\nthe currently generated graph Gt, whereas the second node to select, asecond, can be either from\nGt (forming a cycle), or from C (adding a new substructure). To predict a link, me takes Zafirst\nand Zasecond as inputs and maps to a categorical edge type using an MLP. Finally, the termination\nprobability is computed by \ufb01rstly aggregating the node embeddings into a graph embedding using an\naggregation function AGG, and then mapping the graph embedding to a scalar using an MLP mt.\n\n3.5 Policy Gradient Training\n\nPolicy gradient based methods are widely adopted for optimizing policy networks. Here we adopt\nProximal Policy Optimization (PPO) [35], one of the state-of-the-art policy gradient methods. The\nobjective function of PPO is de\ufb01ned as follows\n\n\u03c0\u03b8(at|st)\n\u03c0\u03b8old (at|st)\n\nmax LCLIP(\u03b8) = Et[min(rt(\u03b8) \u02c6At, clip(rt(\u03b8), 1 \u2212 \u0001, 1 + \u0001) \u02c6At)], rt(\u03b8) =\n\n(5)\nwhere rt(\u03b8) is the probability ratio that is clipped to the range of [1\u2212 \u0001, 1 + \u0001], making the LCLIP(\u03b8) a\nlower bound of the conservative policy iteration objective [17], \u02c6At is the estimated advantage function\nwhich involves a learned value function V\u03c9(\u00b7) to reduce the variance of estimation. In GCPN, V\u03c9(\u00b7)\nis an MLP that maps the graph embedding computed according to Section 3.4.\nIt is known that pretraining a policy network with expert policies if they are available leads to better\ntraining stability and performance [24]. In our setting, any ground truth molecule could be viewed\nas an expert trajectory for pretraining GCPN. This expert imitation objective can be written as\nmin LEXPERT(\u03b8) = \u2212 log(\u03c0\u03b8(at|st)), where (st, at) pairs are obtained from ground truth molecules.\nSpeci\ufb01cally, given a molecule dataset, we randomly sample a molecular graph G, and randomly\nselect one connected subgraph G(cid:48) of G as the state st. At state st, any action that adds an atom or\nbond in G \\ G(cid:48) can be taken in order to generate the sampled molecule. Hence, we randomly sample\nat \u2208 G \\ G(cid:48), and use the pair (st, at) to supervise the expert imitation objective.\n\n4 Experiments\n\nTo demonstrate effectiveness of goal-directed search for molecules with desired properties, we\ncompare our method with state-of-the-art molecule generation algorithms in the following tasks.\nProperty Optimization. The task is to generate novel molecules whose speci\ufb01ed molecular prop-\nerties are optimized. This can be useful in many applications such as drug discovery and materials\nscience, where the goal is to identify molecules with highly optimized properties of interest.\nProperty Targeting. The task is to generate novel molecules whose speci\ufb01ed molecular properties\nare as close to the target scores as possible. This is crucial in generating virtual libraries of molecules\nwith properties that are generally suitable for a desired application. For example, a virtual molecule\nlibrary for drug discovery should have high drug-likeness and synthesizability.\nConstrained Property Optimization. The task is to generate novel molecules whose speci\ufb01ed\nmolecular properties are optimized, while also containing a speci\ufb01ed molecular substructure. This\ncan be useful in lead optimization problems in drug discovery and materials science, where we want\nto make modi\ufb01cations to a promising lead molecule and improve its properties [2].\n\n4.1 Experimental Setup\n\nWe outline our experimental setup in this section. Further details are provided in the appendix1.\nDataset. For the molecule generation experiments, we utilize the ZINC250k molecule dataset [14]\nthat contains 250,000 drug like commercially available molecules whose maximum atom number is\n38. We use the dataset for both expert pretraining and adversarial training.\nMolecule environment. We set up the molecule environment as an OpenAI Gym environment [3]\nusing RDKit [23] and adapt it to the ZINC250k dataset. Speci\ufb01cally, the maximum atom number is\n\n1Link to code and datasets: https://github.com/bowenliu16/rl_graph_generation\n\n6\n\n\fTable 1: Comparison of the top 3 property scores of generated molecules found by each model.\n\nMethod\n\nZINC\nHill Climbing\nORGAN\nJT-VAE\nGCPN\n\n1st\n\n4.52\n\u2212\n3.63\n5.30\n7.98\n\nPenalized logP\n2nd\n\n3rd\n\nValidity\n\nQED\n\n1st\n\n2nd\n\n3rd\n\nValidity\n\n4.30\n\u2212\n3.49\n4.93\n7.85\n\n4.23\n\u2212\n3.44\n4.49\n7.80\n\n100.0% 0.948\n\n0.838\n\n\u2212\n0.4%\n0.896\n100.0% 0.925\n100.0% 0.948\n\n0.948\n\n0.948\n\n100.0%\n\n0.814\n\n0.814\n\n100.0%\n\n0.824\n0.911\n0.947\n\n0.820\n0.910\n0.946\n\n2.2%\n100.0%\n100.0%\n\nset to be 38. There are 9 atom types and 3 edge types, as molecules are represented in kekulized form.\nFor speci\ufb01c reward design, we linearly scale each reward component according to its importance\nin molecule generation from a chemistry point of view as well as the quality of generation results.\nWhen summing up all the rewards collected from a molecule generation trajectory, the range of the\nreward value that the model can get is [\u22124, 4] for \ufb01nal chemical property reward, [\u22122, 2] for \ufb01nal\nchemical \ufb01lter reward, [\u22121, 1] for \ufb01nal adversarial reward, [\u22121, 1] for intermediate adversarial reward\nand [\u22121, 1] for intermediate validity reward.\nGCPN Setup. We use a 3-layer de\ufb01ned GCPN as the policy network with 64 dimensional node\nembedding in all hidden layers, and batch normalization [13] is applied after each layer. Another\n3-layer GCN with the same architecture is used for discriminator training. We \ufb01nd little improvement\nwhen further adding GCN layers. We observe comparable performance among different aggregation\nfunctions and select SUM(\u00b7) for all experiments. We found both the expert pretraining and RL\nobjective important for generating high quality molecules, thus both of them are kept throughout\ntraining. Speci\ufb01cally, we use PPO algorithm to train the RL objective with the default hyperparameters\nas we do not observe too much performance gain from tuning these hyperparameters, and the learning\nrate is set as 0.001. The expert pretraining objective is trained with a learning rate of 0.00025, and we\ndo observe that adding this objective contributes to faster convergence and better performance. Both\nobjectives are trained using Adam optimizer [19] with batch size 32.\nBaselines. We compare our method with the following state-of-the-art baselines. Junction Tree VAE\n(JT-VAE) [16] is a state-of-the-art algorithm that combines graph representation and a VAE framework\nfor generating molecular graphs, and uses Bayesian optimization over the learned latent space to\nsearch for molecules with optimized property scores. JT-VAE has been shown to outperform previous\ndeep generative models for molecule generation, including Character-VAE [9], Grammar-VAE [22],\nSD-VAE [4] and GraphVAE [39]. We also compare our approach with ORGAN [27], a state-of-\nthe-art RL-based molecule generation algorithm using a text-based representation of molecules. To\ndemonstrate the bene\ufb01ts of learning-based approaches, we further implement a simple rule based\nmodel using the stochastic hill-climbing algorithm. We start with a graph containing a single atom\n(the same setting as GCPN), traverse all valid actions given the current state, randomly pick the next\nstate with top 5 highest property score as long as there is improvement over the current state, and loop\nuntil reaching the maximum number of nodes. To make fair comparison across different methods,\nwe set up the same objective functions for all methods, and run all the experiments on the same\ncomputing facilities using 32 CPU cores. We run both deep learning baselines using their released\ncode and allow the baselines to have wall clock running time for roughly 24 hours, while our model\ncan get the results in roughly 8 hours.\n\n4.2 Molecule Generation Results\n\nProperty optimization. In this task, we focus on generating molecules with the highest possible\npenalized logP [22] and QED [1] scores. Penalized logP is a logP score that also accounts for ring\nsize and synthetic accessibility [6], while QED is an indicator of drug-likeness. Note that both scores\nare calculated from empirical prediction models whose parameters are estimated from related datasets\n[41, 1], and these scores are widely used in previous molecule generation papers [9, 22, 4, 39, 27].\nPenalized logP has an unbounded range, while the QED has a range of [0, 1] by de\ufb01nition, thus\ndirectly comparing the percentage improvement of QED may not be meaningful. We adopt the same\nevaluation method in previous approaches [22, 4, 16], reporting the best 3 property scores found by\n\n7\n\n\fTable 2: Comparison of the effectiveness of property targeting task.\n\u22122.5 \u2264 logP \u2264 \u22122\nSuccess Diversity\n\n150 \u2264 MW \u2264 200\nSuccess Diversity\n\n500 \u2264 MW \u2264 550\nSuccess Diversity\n\n5 \u2264 logP \u2264 5.5\n\nSuccess Diversity\n\n0.3%\n\n11.3%\n\n0\n\n85.5%\n\n0.919\n0.846\n\u2212\n0.392\n\n1.3%\n\n7.6%\n0.2%\n54.7%\n\n0.909\n\n0.907\n0.909\n0.855\n\n1.7%\n\n0.7%\n15.1%\n76.1%\n\n0.938\n\n0.824\n0.759\n0.921\n\n0\n\n16.0%\n0.1%\n74.1%\n\n\u2212\n0.898\n0.907\n0.920\n\nMethod\n\nZINC\nJT-VAE\nORGAN\nGCPN\n\nTable 3: Comparison of the performance in the constrained optimization task.\n\n\u03b4\n\n0.0\n0.2\n0.4\n0.6\n\nImprovement\n1.91 \u00b1 2.04\n1.68 \u00b1 1.85\n0.84 \u00b1 1.45\n0.21 \u00b1 0.71\n\nJT-VAE\n\nSimilarity\n0.28 \u00b1 0.15\n0.33 \u00b1 0.13\n0.51 \u00b1 0.10\n0.69 \u00b1 0.06\n\nSuccess\nImprovement\n97.5% 4.20 \u00b1 1.28\n97.1% 4.12 \u00b1 1.19\n83.6% 2.49 \u00b1 1.30\n46.4% 0.79 \u00b1 0.63\n\nGCPN\n\nSimilarity\n0.32 \u00b1 0.12\n0.34 \u00b1 0.11\n0.47 \u00b1 0.08\n0.68 \u00b1 0.08\n\nSuccess\n\n100.0%\n100.0%\n100.0%\n100.0%\n\neach model and the fraction of molecules that satisfy chemical validity. Table 1 summarizes the best\nproperty scores of molecules found by each model, and the statistics in ZINC250k is also shown\nfor comparison. Our method consistently performs signi\ufb01cantly better than previous methods when\noptimizing penalized logP, achieving an average improvement of 61% compared to JT-VAE, and\n186% compared to ORGAN. Our method outperforms all the baselines in the QED optimization task\nas well, and signi\ufb01cantly outperforms the stochastic hill climbing baseline.\nCompared with ORGAN, our model can achieve a perfect validity ratio due to the molecular graph\nrepresentation that allows for step-wise chemical valency check. Compared to JT-VAE, our model\ncan reach a much higher score owing to the fact that RL allows for direct optimization of a given\nproperty score and is able to easily extrapolate beyond the given dataset. Visualizations of generated\nmolecules with optimized logP and QED scores are displayed in Figure 2(a) and (b) respectively.\nAlthough most generated molecules are realistic, in some very rare cases, especially where we reduce\nthe of the adversarial reward and expert pretraining components, our method can generate undesirable\nmolecules with astonishingly high penalized logP predicted by the empirical model, such as the one\non the bottom-right of Figure 2(a) in which our method correctly identi\ufb01ed that Iodine has the highest\nper atom contribution in the empirical model used to calculate logP. These undesirable molecules\nare likely to have inaccurate predicted properties and illustrate an issue with optimizing properties\ncalculated by an empirical model, such as penalized logP and QED, without incorporating prior\nknowledge. Empirical prediction models that predict molecular properties generalize poorly for\nmolecules that are signi\ufb01cantly different from the set of molecules used to train the model. Without\nany restrictions on the generated molecules, an optimization algorithm would exploit the lack of\ngeneralizability of the empirical property prediction models in certain areas of molecule space. Our\nmodel addresses this issue by incorporating prior knowledge of known realistic molecules using\nadversarial training and expert pretraining, which results in more realistic molecules, but with lower\nproperty scores calculated by the empirical prediction models. Note that the hill climbing baseline\nalgorithm mostly generates undesirable cases where the accuracy of the empirical prediction model is\nquestionable, thus its performance with optimizing penalized logP is not listed on Table 1.\nProperty Targeting. In this task, we specify a target range for molecular weight (MW) and logP,\nand report the percentage of generated molecules with property scores within the range, as well as the\ndiversity of generated molecules. The diversity of a set of molecules is de\ufb01ned as the average pairwise\nTanimoto distance between the Morgan \ufb01ngerprints [33] of the molecules. The RL reward for this\ntask is the L1 distance between the property score of a generated molecule and the range center. To\nincrease the dif\ufb01culty, we set the target range such that few molecules in ZINC250k dataset are within\nthe range to test the extrapolation ability of the methods to optimize for a given target. The target\nranges include \u22122.5 \u2264 logP \u2264 \u22122, 5 \u2264 logP \u2264 5.5, 150 \u2264 MW \u2264 200 and 500 \u2264 MW \u2264 550.\n\n8\n\n\fFigure 2: Samples of generated molecules in property optimization and constrained property opti-\nmization task. In (c), the two columns correspond to molecules before and after modi\ufb01cation.\n\nAs is shown in Table 2, GCPN has a signi\ufb01cantly higher success rate in generating molecules with\nproperties within the target range, compared to baseline methods. In addition, GCPN is able to\ngenerate molecules with high diversity, indicating that it is capable of learning a general stochastic\npolicy to generate molecular graphs that ful\ufb01ll the property requirements.\nConstrained Property Optimization. In this experiment, we optimize the penalized logP while\nconstraining the generated molecules to contain one of the 800 ZINC molecules with low penalized\nlogP, following the evaluation in JT-VAE. Since JT-VAE cannot constrain the generated molecule to\nhave certain structure, we adopt their evaluation method where the constraint is relaxed such that the\nmolecule similarity sim(G, G(cid:48)) between the original and modi\ufb01ed molecules is above a threshold \u03b4.\nWe train a \ufb01xed GCPN in an environment whose initial state is randomly set to be one of the 800\nZINC molecules, then conduct the same training procedure as the property optimization task. Over\nthe 800 molecules, the mean and standard deviation of the highest property score improvement and\nthe corresponding similarity between the original and modi\ufb01ed molecules are reported in Table 3.\nOur model signi\ufb01cantly outperforms JT-VAE with 184% higher penalized logP improvement on\naverage, and consistently succeeds in discovering molecules with higher logP scores. Also note that\nJT-VAE performs optimization steps for each given molecule constraint. In contrast, GCPN can\ngeneralize well: it learns a general policy to improve property scores, and applies the same policy\nto all 800 molecules. Figure 2(c) shows that GCPN can modify ZINC molecules to achieve high\npenalized logP score while still containing the substructure of the original molecule.\n\n5 Conclusion\n\nWe introduced GCPN, a graph generation policy network using graph state representation and ad-\nversarial training, and applied it to the task of goal-directed molecular graph generation. GCPN\nconsistently outperforms other state-of-the-art approaches in the tasks of molecular property opti-\nmization and targeting, and at the same time, maintains 100% validity and resemblance to realistic\nmolecules. Furthermore, the application of GCPN can extend well beyond molecule generation.\nThe algorithm can be applied to generate graphs in many contexts, such as electric circuits, social\nnetworks, and explore graphs that can optimize certain domain speci\ufb01c properties.\n\n6 Acknowledgements\n\nThe authors thank Xiang Ren, Marinka Zitnik, Jiaming Song, Joseph Gomes, Amir Barati Farimani,\nPeter Eastman, Franklin Lee, Zhenqin Wu and Paul Wender for their helpful discussions. This\nresearch has been supported in part by DARPA SIMPLEX, ARO MURI, Stanford Data Science\nInitiative, Huawei, JD, and Chan Zuckerberg Biohub. The Pande Group acknowledges the generous\nsupport of Dr. Anders G. Fr\u00f8seth and Mr. Christian Sundt for our work on machine learning. The\nPande Group is broadly supported by grants from the NIH (R01 GM062868 and U19 AI109662) as\nwell as gift funds and contributions from Folding@home donors.\nV.S.P. is a consultant & SAB member of Schrodinger, LLC and Globavir, sits on the Board of\nDirectors of Apeel Inc, Asimov Inc, BioAge Labs, Freenome Inc, Omada Health, Patient Ping, Rigetti\nComputing, and is a General Partner at Andreessen Horowitz.\n\n9\n\n\f7 Appendix\n\nValidity. We de\ufb01ne a molecule as valid if it is able to pass the sanitization checks in RDKit.\nValency. This speci\ufb01es the chemically allowable node degrees for an atom of a particular element.\nSome elements can have multiple possible valencies. At each intermediate step, the molecule\nenvironment checks that each atom in the partially completed graph has not exceeded its maximum\npossible valency of that element type.\nSteric strain \ufb01lter. Valid molecules can still be unrealistic. In particular, it is possible to generate\nvalid molecules that are very sterically strained such that it is unlikely that they will be stable at\nordinary conditions. We designed a simple steric strain \ufb01lter that performs MMFF94 force\ufb01eld [11]\nminimization on a molecule, and then penalizes it as being too sterically strained if the average angle\nbend energy exceeds a cutoff of 0.82 kcal/mol.\nReactive functional group \ufb01lter. We also penalize molecules that possess known problematic\nand reactive functional groups. For simplicity, we use the same set of rules that was used in the\nconstruction of the ZINC dataset, as implemented in RDKit.\nReward design implementation. For property optimization task, we use linear functions to roughly\nmap the minimum and maximum property score of ZINC dataset into the desired reward range. For\nproperty targeting task, we use linear functions to map the absolute difference between the target and\nthe property score into the desired reward range. We threshold the reward such that the reward will not\nexceed the desired reward range, as is described in Section 4.1. For speci\ufb01c parameters, please refer to\nthe open-sourced code: https://github.com/bowenliu16/rl_graph_generation\n\nReferences\n\n[1] G. R. Bickerton, G. V. Paolini, J. Besnard, S. Muresan, and A. L. Hopkins. Quantifying the\n\nchemical beauty of drugs. Nature chemistry, 4(2):90, 2012.\n\n[2] K. H. Bleicher, H.-J. B\u00f6hm, K. M\u00fcller, and A. I. Alanine. Hit and lead generation: beyond\n\nhigh-throughput screening. Nature Reviews Drug Discovery, 2:369\u2013378, 2003.\n\n[3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.\n\nOpenai gym. CoRR, abs/1606.01540, 2016.\n\n[4] H. Dai, Y. Tian, B. Dai, S. Skiena, and L. Song. Syntax-directed variational autoencoder for\n\nstructured data. arXiv preprint arXiv:1802.08786, 2018.\n\n[5] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik,\nand R. P. Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints. In\nAdvances in neural information processing systems, 2015.\n\n[6] P. Ertl. Estimation of synthetic accessibility score of drug-like molecules. J. Cheminform, 2009.\n[7] P. Ertl, R. Lewis, E. J. Martin, and V. Polyakov. In silico generation of novel, drug-like chemical\n\nmatter using the LSTM neural network. CoRR, abs/1712.07449, 2017.\n\n[8] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for\n\nquantum chemistry, 2017.\n\n[9] R. G\u00f3mez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hern\u00e1ndez-Lobato, B. S\u00e1nchez-Lengeling,\nD. Sheberla, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik. Auto-\nmatic chemical design using a data-driven continuous representation of molecules. ACS Central\nScience, 2016.\n\n[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\n2014.\n\n[11] T. A. Halgren. Merck molecular force \ufb01eld. i. basis, form, scope, parameterization, and\n\nperformance of mmff94. Journal of computational chemistry, 17(5-6):490\u2013519, 1996.\n\n[12] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In\n\nAdvances in Neural Information Processing Systems, 2017.\n\n10\n\n\f[13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In F. Bach and D. Blei, editors, Proceedings of the 32nd International\nConference on Machine Learning, volume 37 of Proceedings of Machine Learning Research,\npages 448\u2013456, Lille, France, 07\u201309 Jul 2015. PMLR.\n\n[14] J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G. Coleman. Zinc: a free tool to\ndiscover chemistry for biology. Journal of chemical information and modeling, 52(7):1757\u2013\n1768, 2012.\n\n[15] E. Jannik Bjerrum and R. Threlfall. Molecular Generation with Recurrent Neural Networks\n\n(RNNs). arXiv preprint arXiv:1705.04612, 2017.\n\n[16] W. Jin, R. Barzilay, and T. Jaakkola. Junction tree variational autoencoder for molecular graph\n\ngeneration. arXiv preprint arXiv:1802.04364, 2018.\n\n[17] S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In\n\nInternational Conference on Machine Learning, 2002.\n\n[18] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley. Molecular graph convolutions:\nmoving beyond \ufb01ngerprints. Journal of Computer-Aided Molecular Design, 30:595\u2013608, Aug.\n2016.\n\n[19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference\n\non Learning Representations, 2015.\n\n[20] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nIn ICLR, 2016.\n\n[21] P. Kirkpatrick and C. Ellis. Chemical space. Nature, 432:823 EP \u2013, Dec 2004.\n[22] M. J. Kusner, B. Paige, and J. M. Hern\u00e1ndez-Lobato. Grammar variational autoencoder. In\nD. Precup and Y. W. Teh, editors, International Conference on Machine Learning, volume 70 of\nProceedings of Machine Learning Research, International Convention Centre, Sydney, Australia,\n06\u201311 Aug 2017. PMLR.\n\n[23] G. Landrum. Rdkit: Open-source cheminformatics. 2006. Google Scholar, 2006.\n[24] S. Levine and V. Koltun. Guided policy search. In International Conference on Machine\n\nLearning, 2013.\n\n[25] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia. Learning deep generative models of\n\ngraphs. arXiv preprint arXiv:1803.03324, 2018.\n\n[26] Y. Li, L. Zhang, and Z. Liu. Multi-Objective De Novo Drug Design with Conditional Graph\n\nGenerative Model. ArXiv e-prints, Jan. 2018.\n\n[27] G. Lima Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P. L. Cunha Farias, and A. Aspuru-\nGuzik. Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Gener-\nation Models. ArXiv e-prints, May 2017.\n\n[28] J. H. Lin and A. Y. H. Lu. Role of pharmacokinetics and metabolism in drug discovery and\n\ndevelopment. Pharmacological Reviews, 49(4):403\u2013449, 1997.\n\n[29] C. A. Lipinski, F. Lombardo, B. W. Dominy, and P. J. Feeney. Experimental and computational\napproaches to estimate solubility and permeability in drug discovery and development settings.\nAdvanced Drug Delivery Reviews, 23(1):3\u201325, 1997.\n\n[30] B. Liu, B. Ramsundar, P. Kawthekar, J. Shi, J. Gomes, Q. Luu Nguyen, S. Ho, J. Sloane,\nP. Wender, and V. Pande. Retrosynthetic reaction prediction using neural sequence-to-sequence\nmodels. ACS Central Science, 3(10):1103\u20131113, 2017.\n\n[31] M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen. Molecular de-novo design through deep\n\nreinforcement learning. Journal of Cheminformatics, 9(1):48, Sep 2017.\n\n[32] P. G. Polishchuk, T. I. Madzhidov, and A. Varnek. Estimation of the size of drug-like chemical\nspace based on gdb-17 data. Journal of Computer-Aided Molecular Design, 27(8):675\u2013679,\nAug 2013.\n\n[33] D. Rogers and M. Hahn. Extended-connectivity \ufb01ngerprints. Journal of chemical information\n\nand modeling, 50(5):742\u2013754, 2010.\n\n11\n\n\f[34] B. Sanchez-Lengeling, C. Outeiral, G. L. Guimaraes, and A. Aspuru-Guzik. Optimizing\ndistributions over molecular space. An Objective-Reinforced Generative Adversarial Network\nfor Inverse-design Chemistry (ORGANIC). ChemRxiv e-prints, 8 2017.\n\n[35] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. CoRR, abs/1707.06347, 2017.\n\n[36] K. T. Sch\u00fctt, F. Arbabzadah, S. Chmiela, K. R. M\u00fcller, and A. Tkatchenko. Quantum-chemical\ninsights from deep tensor neural networks. Nature Communications, 8:13890, Jan 2017. Article.\n[37] M. D. Segall. Multi-parameter optimization: Identifying high quality compounds with a balance\n\nof properties. Current Pharmaceutical Design, 18(9):1292\u20131310, 2012.\n\n[38] M. H. S. Segler, T. Kogej, C. Tyrchan, and M. P. Waller. Generating focused molecule libraries\nfor drug discovery with recurrent neural networks. ACS Central Science, 4(1):120\u2013131, 2018.\n[39] M. Simonovsky and N. Komodakis. Graphvae: Towards generation of small graphs using\n\nvariational autoencoders. arXiv preprint arXiv:1802.03480, 2018.\n\n[40] D. Weininger. Smiles, a chemical language and information system. 1. introduction to method-\nology and encoding rules. Journal of chemical information and computer sciences, 28(1):31\u201336,\n1988.\n\n[41] S. A. Wildman and G. M. Crippen. Prediction of physicochemical parameters by atomic\ncontributions. Journal of Chemical Information and Computer Sciences, 39(5):868\u2013873, 1999.\n[42] X. Yang, J. Zhang, K. Yoshizoe, K. Terayama, and K. Tsuda. ChemTS: An Ef\ufb01cient Python\n\nLibrary for de novo Molecular Generation. ArXiv e-prints, Sept. 2017.\n\n[43] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec. Graphrnn: A deep generative model\n\nfor graphs. arXiv preprint arXiv:1802.08773, 2018.\n\n[44] L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets with policy\n\ngradient. In AAAI, 2017.\n\n12\n\n\f", "award": [], "sourceid": 3162, "authors": [{"given_name": "Jiaxuan", "family_name": "You", "institution": "Stanford University"}, {"given_name": "Bowen", "family_name": "Liu", "institution": "Stanford University"}, {"given_name": "Zhitao", "family_name": "Ying", "institution": "Stanford University"}, {"given_name": "Vijay", "family_name": "Pande", "institution": "Stanford"}, {"given_name": "Jure", "family_name": "Leskovec", "institution": "Stanford University and Pinterest"}]}