{"title": "Constrained Graph Variational Autoencoders for Molecule Design", "book": "Advances in Neural Information Processing Systems", "page_first": 7795, "page_last": 7804, "abstract": "Graphs are ubiquitous data structures for representing interactions between entities. With an emphasis on applications in chemistry, we explore the task of learning to generate graphs that conform to a distribution observed in training data. We propose a variational autoencoder model in which both encoder and decoder are graph-structured. Our decoder assumes a sequential ordering of graph extension steps and we discuss and analyze design choices that mitigate the potential downsides of this linearization. Experiments compare our approach with a wide range of baselines on the molecule generation task and show that our method is successful at matching the statistics of the original dataset on semantically important metrics. Furthermore, we show that by using appropriate shaping of the latent space, our model allows us to design molecules that are (locally) optimal in desired properties.", "full_text": "Constrained Graph Variational Autoencoders for\n\nMolecule Design\n\nQi Liu\u22171, Miltiadis Allamanis2, Marc Brockschmidt2, and Alexander L. Gaunt2\n\n1Singapore University of Technology and Design\n\n2Microsoft Research, Cambridge\n\nqiliu@u.nus.edu, {miallama, mabrocks, algaunt}@microsoft.com\n\nAbstract\n\nGraphs are ubiquitous data structures for representing interactions between entities.\nWith an emphasis on applications in chemistry, we explore the task of learning to\ngenerate graphs that conform to a distribution observed in training data. We propose\na variational autoencoder model in which both encoder and decoder are graph-\nstructured. Our decoder assumes a sequential ordering of graph extension steps\nand we discuss and analyze design choices that mitigate the potential downsides\nof this linearization. Experiments compare our approach with a wide range of\nbaselines on the molecule generation task and show that our method is successful\nat matching the statistics of the original dataset on semantically important metrics.\nFurthermore, we show that by using appropriate shaping of the latent space, our\nmodel allows us to design molecules that are (locally) optimal in desired properties.\n\n1\n\nIntroduction\n\nStructured objects such as program source code, physical systems, chemical molecules and even 3D\nscenes are often well represented using graphs [2, 6, 16, 25]. Recently, considerable progress has been\nmade on building discriminative deep learning models that ingest graphs as inputs [4, 9, 17, 21]. Deep\nlearning approaches have also been suggested for graph generation. More speci\ufb01cally, generating\nand optimizing chemical molecules has been identi\ufb01ed as an important real-world application for this\nset of techniques [8, 23, 24, 28, 29].\nIn this paper, we propose a novel probabilistic model for graph generation that builds gated graph\nneural networks (GGNNs) [21] into the encoder and decoder of a variational autoencoder (VAE) [15].\nFurthermore, we demonstrate how to incorporate hard domain-speci\ufb01c constraints into our model to\nadapt it for the molecule generation task. With these constraints in place, we refer to our model as a\nconstrained graph variational autoencoder (CGVAE). Additionally, we shape the latent space of the\nVAE to allow optimization of numerical properties of the resulting molecules. Our experiments are\nperformed with real-world datasets of molecules with pharmaceutical and photo-voltaic applications.\nBy generating novel molecules from these datasets, we demonstrate the bene\ufb01ts of our architectural\nchoices. In particular, we observe that (1) the GGNN architecture is bene\ufb01cial for state-of-the-art\ngeneration of molecules matching chemically relevant statistics of the training distribution, and (2)\nthe semantically meaningful latent space arising from the VAE allows continuous optimization of\nmolecule properties [8].\nThe key challenge in generating graphs is that sampling directly from a joint distribution over all\ncon\ufb01gurations of labeled nodes and edges is intractable for reasonably sized graphs. Therefore,\na generative model must decompose this joint in some way. A straightforward approximation is\nto ignore correlations and model the existence and label of each edge with independent random\n\n\u2217work performed during an internship with Microsoft Research, Cambridge.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fvariables [5, 30, 31]. An alternative approach is to factor the distribution into a sequence of discrete\ndecisions in a graph construction trace [22, 35]. Since correlations between edges are usually crucial\nin real applications, we pick the latter, sequential, approach in this work. Note that for molecule\ndesign, some correlations take the form of known hard rules governing molecule stability, and we\nexplicitly enforce these rules wherever possible using a technique that masks out choices leading\nto illegal graphs [18, 28]. The remaining \u201csoft\u201d correlations (e.g. disfavoring of small cycles) are\nlearned by our graph structured VAE.\nBy opting to generate graphs sequentially, we lose permutation symmetry and have to train using\narbitrary graph linearizations. For computational reasons, we cannot consider all possible lineariza-\ntions for each graph, so it is challenging to marginalize out the construction trace when computing\nthe log-likelihood of a graph in the VAE objective. We design a generative model where the learned\ncomponent is conditioned only on the current state of generation and not on the arbitrarily chosen\npath to that state. We argue that this property is intuitively desirable and show how to derive a bound\nfor the desired log-likelihood under this model. Furthermore, this property makes the model relatively\nshallow and it is easy scale and train.\n\n2 Related Work\n\nGenerating graphs has a long history in research, and we consider three classes of related work:\nWorks that ignore correlations between edges, works that generate graphs sequentially and works that\nemphasize the application to molecule design.\n\nUncorrelated generation The Erd\u02ddos-R\u00e9nyi G(n, p) random graph model [5] is the simplest exam-\nple of this class of algorithms, where each edge exists with independent probability p. Stochastic\nblock models [31] add community structure to the Erd\u02ddos-R\u00e9nyi model, but retain uncorrelated edge\nsampling. Other traditional random graph models such as those of Albert and Barab\u00e1si [1], Leskovec\net al. [20] do account for edge correlations, but they are hand-crafted into the models. A more modern\nlearned approach in this class is GraphVAEs [30], where the decoder emits independent probabilities\ngoverning edge and node existence and labels.\n\nSequential generation Johnson [14] sidesteps the issue of permutation symmetry by considering\nthe task of generating a graph from an auxiliary stream of information that imposes an order on\nconstruction steps. This work outlined many ingredients for the general sequential graph generation\ntask: using GGNNs to embed the current state of generation and multi-layer perceptrons (MLPs) to\ndrive decisions based on this embedding. Li et al. [22] uses these ingredients to build an autoregressive\nmodel for graphs without the auxiliary stream. Their model gives good results, but each decision\nis conditioned on a full history of the generation sequence, and the authors remark on stability and\nscalability problems arising from the optimization of very deep neural networks. In addition, they\ndescribe some evidence for over\ufb01tting to the chosen linearization scheme due to the strong history\ndependence. Our approach also uses the ingredients from Johnson [14], but avoids the training and\nover\ufb01tting problems using a model that is conditioned only on the current partial graph rather than on\nfull generation traces. In addition, we combine Johnson\u2019s ingredients with a VAE that produces a\nmeaningful latent space to enable continuous graph optimization [8].\nAn alternative sequential generation algorithm based on RNNs is presented in You et al. [35]. The\nauthors point out that a dense implementation of a GGNN requires a large number O(eN 2) of\noperations to construct a graph with e edges and N nodes. We note that this scaling problem can be\nmitigated using a sparse GGNN implementation [2], which reduces complexity to O(e2).\n\nMolecule design Traditional in silico molecule design approaches rely on considerable domain\nknowledge, physical simulation and heuristic search algorithms (for a recent example, see G\u00f3mez-\nBombarelli et al. [7]). Several deep learning approaches have also been tailored to molecule design,\nfor example [13] is a very promising method that uses a library of frequent (ring-containing) fragments\nto reduce the graph generation process to a tree generation process where nodes represent entire\nfragments. Alternatively, many methods rely on the SMILES linearization of molecules [33] and use\nRNNs to generate new SMILES strings [8, 23, 24, 29]. A particular challenge of this approach is to\nensure that the generated strings are syntactically valid under the SMILES grammar. The Grammar\nVAE uses a mask to impose these constraints during generation and a similar technique is applied\n\n2\n\n\fFigure 1: Illustration of the phases of the generative procedure. Nodes are initialized with latent\nvariables and then we enter a loop between edge selection, edge labelling and node update steps\nuntil the special stop node (cid:11) is selected. We then refocus to a new node or terminate if there are\nno candidate focus nodes in the connected component. A looped arrow indicates that several loop\niterations may happen between the illustrated steps.\n\nfor general graph construction in Samanta et al. [28]. Our model also employs masking that, among\nother things, ensures that the molecules we generate can be converted to syntactically valid SMILES\nstrings.\n\n3 Generative Model\n\nOur generative procedure is illustrated in Fig. 1. The process is seeded with N vectors zv that\ntogether form a latent \u201cspeci\ufb01cation\u201d for the graph to be generated (N is an upper bound on the\nnumber of nodes in the \ufb01nal graph). Generation of edges between these nodes then proceeds using\ntwo decision functions: focus and expand. In each step the focus function chooses a focus node to\nvisit, and then the expand function chooses edges to add from the focus node. As in breadth-\ufb01rst\ntraversal, we implement focus as a deterministic queue (with a random choice for the initial node).\nOur task is thus reduced to learning the expand function that enumerates new edges connected to\nthe currently focused node. One design choice is to make expand condition upon the full history\nof the generation. However, this has both theoretical and practical downsides. Theoretically, this\nmeans that the learned model is likely to learn to reproduce generation traces. This is undesirable,\nsince the underlying data usually only contains fully formed graphs; thus the exact form of the trace\nis an artifact of the implemented data preprocessing. Practically, this would lead to extremely deep\ncomputation graphs, as even small graphs easily have many dozens of edges; this makes training\nof the resulting models very hard as mentioned in mentioned in Li et al. [22]. Hence, we condition\nexpand only upon the partial graph structure G(t) generated so far; intuitively, this corresponds to\nlearning how to complete a partial graph without using any information about how the partial graph\nwas generated. We now present the details of each stage of this generative procedure.\n\nNode Initialization We associate a state h(t=0)\nwith each node v in a set of initially unconnected\nnodes. Speci\ufb01cally, zv is drawn from the d-dimensional standard normal N (0, I), and h(t=0)\nis the\nconcatenation [zv, \u03c4v], where \u03c4v is an interpretable one-hot vector indicating the node type. \u03c4v is\nderived from zv by sampling from the softmax output of a learned mapping \u03c4v \u223c f (zv) where f is a\nneural network2. The interpretable component of h(t=0)\ngives us a means to enforce hard constraints\nduring generation.\nFrom these node-level variables, we can calculate global representations H(t) (the average representa-\ntion of nodes in the connected component at generation step t), and Hinit (the average representation\nof all nodes at t = 0). In addition to N working nodes, we also initialize a special \u201cstop node\u201d to a\nlearned representation h(cid:11) for managing algorithm termination (see below).\n\nv\n\nv\n\nv\n\n2We implement f as a linear classi\ufb01er from the 100 dimensional latent space to one of the node type classes.\n\n3\n\n\ud835\udc41\ud835\udc33\ud835\udc63\ud835\uded5\ud835\udc63=\ud835\udc53(\ud835\udc33\ud835\udc63)\u2026\ud835\udc33\ud835\udc63\u2026\ud835\uded5\ud835\udc63\u210e\ud835\udc620\u210e\ud835\udc630\ud835\udc63\ud835\udc62\ud835\udc63\ud835\udc63Node InitializationEdge Selection\ud835\udc36(\ud835\udf53)scoresampleEdge Labelling\ud835\udc3f\u2113(\ud835\udf53)\ud835\udc63score123\ud835\udc63sample\u210e\ud835\udc63\ud835\udc61+1\ud835\udc3a\ud835\udc41\ud835\udc41Node Updatenode stoprefocusTerminationglobal stop\ud835\udc63\ud835\udc570=1\ud835\udc63\ud835\udc57\ud835\udc61=1\ud835\udc63\ud835\udc560=4\ud835\udc63\ud835\udc56\ud835\udc61=2\ud835\udc33\ud835\udc63\u2032\fNode Update Whenever we obtain a new graph G(t+1), we discard h(t)\nv and compute new rep-\nresentations h(t+1)\nfor all nodes taking their (possibly changed) neighborhood into account. This\nis implemented using a standard gated graph neural network (GGNN) Gdec for S steps3, which is\nde\ufb01ned as a recurrent operation over messages m(s)\nv .\n\nv\n\nm(0)\n\nv = h(0)\nv\n\nm(s+1)\n\nv\n\n= GRU\n\nm(s)\nv ,\n\nE(cid:96)(m(s)\nu )\n\nh(t+1)\n\nv\n\n= m(S)\n\nv\n\n(cid:34)\n\n(cid:88)\n\nv (cid:96)\u2194u\n\n(cid:35)\n\nHere the sum runs over all edges in the current graph and E(cid:96) is an edge-type speci\ufb01c neural network4\nWe also augment our model with a master node as described by Gilmer et al. [6]. Note that since\nh(t+1)\nis independent of the\ngeneration history of G(t+1).\n\nv , the representation h(t+1)\n\nis computed from h(0)\nv\n\nrather than h(t)\n\nv\n\nv\n\nEdge Selection and Labelling We \ufb01rst pick a focus node v from our queue. The function expand\nthen selects edges v (cid:96)\u2194 u from v to u with label (cid:96) as follows. For each non-focus node u, we construct\na feature vector \u03c6(t)\nu , dv,u, Hinit, H(t)], where dv,u is the graph distance between v and\nu. This provides the model with both local information for the focus node v and the candidate edge\n(h(t)\nu ), and global information regarding the original graph speci\ufb01cation (Hinit) and the current\ngraph state (H(t)). We use these representations to produce a distribution over candidate edges:\n\nv,u = [h(t)\n\nv , h(t)\n\nv , h(t)\n\np(v (cid:96)\u2194 u | \u03c6(t)\n\nv,u) = p((cid:96) | \u03c6(t)\n\nv,u, v \u2194 u) \u00b7 p(v \u2194 u | \u03c6(t)\nv,u).\n\nThe factors are calculated as softmax outputs from neural networks C (determining the target node\nfor an edge) and L(cid:96) (determining the type of the edge):5\n\np(v \u2194 u | \u03c6(t)\n\nv,u) =\n\nM (t)\nw M (t)\n\nv\u2194u exp[C(\u03c6(t)\n\nv,u)]\nv\u2194w exp[C(\u03c6(t)\n\nv,w)]\n\n, p((cid:96) | \u03c6(t)\n\nv,u) =\n\n(cid:80)\n\n(cid:80)\n\nv (cid:96)\u2194u exp[L(cid:96)(\u03c6(t)\nm(t)\nv,u)]\nv k\u2194u exp[Lk(\u03c6(t)\nk m(t)\n\nv,u)]\n\n.\n\n(1)\n\nv\u2194u and m(t)\n\nv (cid:96)\u2194u are binary masks that forbid edges that violate constraints. We discuss the\nM (t)\nconstruction of these masks for the molecule generation case in Sect. 5.2. New edges are sampled\nfrom these distributions, and any nodes that are connected to the graph for the \ufb01rst time are added to\nthe focus queue. Note that we only consider undirected edges in this paper, but it is easy to extend\nthe model to directed graphs.\n\nTermination We keep adding edges to a node v using expand and Gdec until an edge to the stop\nnode is selected. Node v then loses focus and becomes \u201cclosed\u201d (mask M ensures that no further\nedges will ever be made to v). The next focus node is selected from the focus queue. In this way, a\nsingle connected component is grown in a breadth-\ufb01rst manner. Edge generation continues until the\nqueue is empty (note that this may leave some unconnected nodes that will be discarded).\n\n4 Training the Generative Model\n\nThe model from Sect. 3 relies on a latent space with semantically meaningful points concentrated in\nthe region weighted under the standard normal, and trained networks f, C, L(cid:96) and Gdec. We train\nthese in a VAE architecture on a large dataset D of graphs. Details of this VAE are provided below.\n\n4.1 Encoder\nThe encoder of our VAE is a GGNN Genc that embeds each node in an input graph G to a diagonal\nnormal distribution in d-dimensional latent space parametrized by mean \u00b5v and standard deviation\n\u03c3v vectors. The latent vectors zv are sampled from these distributions, and we construct the usual\nVAE regularizer term measuring the KL divergence between the encoder distribution and the standard\n\nv\u2208G KL(N(cid:0)\u00b5v, diag(\u03c3v)2(cid:1)|| N (0, I)).\n\nGaussian prior: Llatent =(cid:80)\n\n3Our experiments use S = 7.\n4In our implementation, E(cid:96) is a dimension-preserving linear transformation.\n5C and L(cid:96) are fully connected networks with a single hidden layer of 200 units and ReLU non-linearities.\n\n4\n\n\f4.2 Decoder\n\nThe decoder is the generative procedure described in Sect. 3, and we condition generation on a latent\nsample from the encoder distribution during training. We supervise training of the overall model\nusing generation traces extracted from graphs in D.\n\nNode Initialization To obtain initial node states h(t=0)\n, we \ufb01rst sample a node speci\ufb01cation zv for\neach node v and then independently for each node we generate the label \u03c4v using the learned function\nf. The probability of re-generating the labels \u03c4 \u2217\nv observed in the encoded graph is given by a sum\nover node permutations P:\n\nv\n\np(G(0) | z) =\n\np(\u03c4 = P(\u03c4 \u2217) | z) >\n\np(\u03c4v = \u03c4 \u2217\n\nv | zv).\n\n(cid:89)\n\nv\n\n(cid:88)\n\nP\n\nThis inequality provides a lower bound given by the single contribution from the ordering used in\nthe encoder (recall that in the encoder we know the node type \u03c4 \u2217\nv from which zv was generated). A\nset2set model [32] could improve this bound.\n\nEdge Selection and Labelling During training, we provide supervision on the sequence of edge\nadditions based on breadth-\ufb01rst traversals of each graph in the dataset D. Formally, to learn a\ndistribution over graphs (and not graph generation traces), we would need to train with an objective\nthat computes the log-likelihood of each graph by marginalizing over all possible breadth-\ufb01rst traces.\nThis is computationally intractable, so in practice we only compute a Monte-Carlo estimate of the\nmarginal on a small set of sampled traces. However, recall from Sect. 3 that our expand model is not\nconditioned on full traces, and instead only considers the partial graph generated so far. Below we\noutline how this intuitive design formally affects the VAE training objective.\nGiven the initial collection of unconnected nodes, G(0), from the initialization above, we \ufb01rst use\nJensen\u2019s inequality to show that the log-likelihood of a graph G is loosely lower bounded by the\nexpected log-likelihood of all the traces \u03a0 that generate it.\n\nlog p(G | G(0)) = log\n\np(\u03c0 | G(0)) \u2265 log(|\u03a0|) +\n\n1\n|\u03a0|\n\nlog p(\u03c0 | G(0))\n\n(2)\n\nWe can decompose each full generation trace \u03c0 \u2208 \u03a0 into a sequence of steps of the form (t, v, \u0001),\nwhere v is the current focus node and \u0001 = v (cid:96)\u2194 u is the edge added at step t:\n\n(cid:88)\n\n\u03c0\u2208\u03a0\n\nlog p(\u03c0 | G(0)) =\n\nlog p(v | \u03c0, t) + log p(\u0001 | G(t\u22121), v)\n\n(cid:88)\n\n\u03c0\u2208\u03a0\n\n(cid:88)\n\n(cid:110)\n\n(t,v,\u0001)\u2208\u03c0\n\n(cid:111)\n\nThe \ufb01rst term corresponds to the choice of v as focus node at step t of trace \u03c0. As our focus function\nis \ufb01xed, this choice is uniform in the \ufb01rst focus node and then deterministically follows a breadth-\ufb01rst\nqueuing system. A summation over this term thus evaluates to the constant log(1/N ).\nAs discussed above, the second term is only conditioned on the current graph (and not the\nwhole generation history G(0) . . .G(t\u22121)). To evaluate it further, we consider the set of gener-\nation states S of all valid state pairs s = (G(t), v) of a partial graph G(t) and a focus node v.\nWe use |s| to denote the multiplicity of state s in \u03a0, i.e., the number of traces\nthat contain graph G(t) and focus on node v. Let Es denote all edges that\ncould be generated at state s, i.e., the edges from the focus node v that are\npresent in the graph G from the dataset, but are not yet present in G(t). Then,\neach of these appears uniformly as the next edge to generate in a trace for\nall |s| occurrences of s in a trace from \u03a0,\nand therefore, we can rearrange a sum over paths into a sum over steps:\n\n(cid:88)\n\n(cid:88)\n\n\u03c0\u2208\u03a0\n\n(t,v,\u0001)\u2208\u03c0\n\n1\n|\u03a0|\n\nlog p(\u0001 | s) =\n\n1\n|\u03a0|\n\n= Es\u223c\u03a0\n\n|s|\n|Es| log p(\u0001 | s)\n(cid:35)\n(cid:88)\nlog p(\u0001 | s)\n\n\u0001\u2208Es\n\n(cid:88)\n\n\u0001\u2208Es\n\n(cid:88)\n(cid:34)\n\ns\u2208S\n\n1\n|Es|\n\n5\n\nFigure 2: Steps con-\nsidered in our model.\n\n1312121111|\u2130\ud835\udc60|= focus\fHere we use that |s|/|\u03a0| is the probability of observing state s in a random draw from all states\nin \u03a0. We use this expression in Eq. 2 and train our VAE with a reconstruction loss Lrecon. =\n\n(cid:80)G\u2208D log(cid:2)p(G | G(0)) \u00b7 p(G(0) | z)(cid:3) ignoring additive constants.\n\nWe evaluate the expectation over states s using a Monte Carlo estimate from a set of enumerated\ngeneration traces. In practice, this set of paths is very small (e.g. a single trace) resulting in a high\nvariance estimate. Intuitively, Fig. 2 shows that rather than requiring the model to exactly reproduce\neach step of the sampled paths (orange) our objective does not penalize the model for choosing any\nvalid expansion at each step (black).\n\n4.3 Optimizing Graph Properties\n\nSo far, we have described a generative model for graphs. In addition, we may wish to perform (local)\noptimization of these graphs with respect to some numerical property, Q. This is achieved by gradient\nascent in the continuous latent space using a differentiable gated regression model\n\n(cid:88)\n\nR(zv) =\n\n\u03c3(g1(zv)) \u00b7 g2(zv),\n\nv\n\nwhere g1 and g2 are neural networks6 and \u03c3 is the sigmoid function. Note that the combination\nof R with Genc (i.e., R(Genc(G))) is exactly the GGNN regression model from Gilmer et al. [6].\nDuring training, we use an L2 distance loss LQ between R(zv) and the labeled properties Q. This\nregression objective shapes the latent space, allowing us to optimize for the property Q in it. Thus, at\ntest time, we can sample an initial latent point zv and then use gradient ascent to a locally optimal\npoint z\u2217\nv within the standard normal prior of the VAE.\nDecoding from the point z\u2217\nv then produces graphs with an optimized property Q. We show this in our\nexperiments in Sect. 6.2.\n\nv subject to an L2 penalty that keeps the z\u2217\n\n4.4 Training objective\nThe overall objective is L = Lrecon. + \u03bb1Llatent + \u03bb2LQ, consisting of the usual VAE objective\n(reconstruction terms and regularization on the latent variables) and the regression loss. Note that we\nallow deviation from the pure VAE loss (\u03bb1 = 1) following Yeung et al. [34].\n\n5 Application: Molecule Generation\n\nIn this section, we describe additional specialization of our model for the application of generating\nchemical molecules. Speci\ufb01cally, we outline details of the molecular datasets that we use and the\ndomain speci\ufb01c masking factors that appear in Eq. 1.\n\n5.1 Datasets\n\nWe consider three datasets commonly used in the evaluation of computational chemistry approaches:\n\u2022 QM9 [26, 27], an enumeration of \u223c 134k stable organic molecules with up to 9 heavy atoms\n(carbon, oxygen, nitrogen and \ufb02uorine). As no \ufb01ltering is applied, the molecules in this\ndataset only re\ufb02ect basic structural constraints.\n\u2022 ZINC dataset [12], a curated set of 250k commercially available drug-like chemical com-\npounds. On average, these molecules are bigger (\u223c 23 heavy atoms) and structurally more\ncomplex than the molecules in QM9.\n\u2022 CEPDB [10, 11], a dataset of organic molecules with an emphasis on photo-voltaic appli-\ncations. The contained molecules have \u223c 28 heavy atoms on average and contain six to\nseven rings each. We use a subset of the full database containing 250k randomly sampled\nmolecules.\n\nFor all datasets we kekulize the molecules so that the only edge types to consider are single, double\nand triple covalent bonds and we remove all hydrogen atoms. In the encoder, molecular graphs are\npresented with nodes annotated with onehot vectors \u03c4 \u2217\n\nv indicating their atom type and charge.\n6In our experiments, both g1 and g2 are implemented as linear transformations that project to scalars.\n\n6\n\n\f(a)\n\n(b)\n\nMeasure\n\n2: CGVAE 3: [22]\n\n4: LSTM 5: [8]\n\n6: [18] 7: [30] 8: [28]\n\n100\n94.35\n98.57\n\n100\n100\n99.82\n\n100\n100\n99.62\n\n-\n-\n-\n\n89.20\n89.10\n99.41\n\n-\n-\n-\n\n94.78\n82.98\n96.94\n\n96.80\n100\n99.97\n\n99.61\n92.43\n99.56\n\n10.00\n90.00\n67.50\n\n17.00\n98.00\n30.98\n\n8.30\n90.05\n80.99\n\n30.00\n95.44\n9.30\n\n31.00\n100\n10.76\n\n0.00\n-\n-\n\n61.00\n85.00\n40.90\n\n14.00\n100\n31.60\n\n-\n-\n-\n\n98.00\n100\n99.86\n\n-\n-\n-\n\n-\n-\n-\n\n9 % valid\n% novel\nM\nQ\n% unique\n\nC % valid\n% novel\nN\nI\n% unique\nZ\n\nB % valid\nD\n% novel\nP\nE\n% unique\nC\n\n(c)\n\nFigure 3: Overview of statistics of sampled molecules from a range of generative models trained\non different datasets. In (b) We highlight the target statistics of the datasets in yellow and use the\nnumbers 2, ..., 7 to denote different models as shown in the axis key. A hatched box indicates where\nother works do not supply benchmark results. Two samples from our model on each dataset are\nshown in (c), with more random samples given in supplementary material A.\n\n5.2 Valency masking\n\nValency rules impose a strong constraint on constructing syntactically valid molecules7. The valency\nof an atom indicates the number of bonds that that atom can make in a stable molecule, where edge\ntypes \u201cdouble\u201d and \u201ctriple\u201d count for 2 and 3 bonds respectively. In our data, each node type has\na \ufb01xed valency given by known chemical properties, for example node type \u201cO\u201d (an oxygen atom)\nhas a valency of 2 and node type \u201cO\u2212\u201d (an oxygen ion) has valency of 1. Throughout the generation\nprocess, we use masks M and m to guarantee that the number of bonds bv at each node never exceeds\nv \u2212 bv hydrogen atoms to\nthe valency b\u2217\nnode v. In this way, our generation process always produces syntactically valid molecules (we de\ufb01ne\nsyntactic validity as the ability to parse the graph to a SMILES string using the RDKit parser [19]).\nMore speci\ufb01cally, M (t)\nv\u2194u also handles avoidance of edge duplication and self loops, and is de\ufb01ned\nas:\n\nv at the end of generation we link b\u2217\n\nv of the node. If bv < b\u2217\n\nM (t)\n\nv) \u00d7 1(bu < b\u2217\n\nv\u2194u = 1(bv < b\u2217\n\nu) \u00d7 1(no v \u2194 u exists) \u00d7 1(v (cid:54)= u) \u00d7 1(u is not closed),\n\n(3)\nwhere 1 is an indicator function, and as a special case, connections to the stop node are always\nunmasked. Further, when selecting the label for a chosen edge, we must again avoid violating the\nu \u2212 bu \u2264 (cid:96)), using (cid:96) = 1, 2, 3 to indicate\nv (cid:96)\u2194u = M (t)\nvalency constraint, so we de\ufb01ne m(t)\nsingle, double and triple bond types respectively\n\nv\u2194u \u00d7 1(b\u2217\n\n6 Experiments\n\nWe evaluate baseline models, our model (CGVAE) and a number of ablations on the two tasks of\nmolecule generation and optimization8.\n\n6.1 Generating molecules\n\nAs baseline models, we consider the deep autoregressive graph model (that we refer to as DeepGAR)\nfrom [22], a SMILES generating LSTM language model with 256 hidden units (reduced to 64 units\n\n7Note that more complex domain knowledge e.g. Bredt\u2019s rule [3] could also be handled in our model but we\n\ndo not implement this here.\nimplementation\n\n8Our\n\nbe\nconstrained-graph-variational-autoencoder.\n\nof CGVAE can\n\nfound\n\nat\n\nhttps://github.com/Microsoft/\n\n7\n\nZINCCEPDBQM90123123456780121 (Data)2 (CGVAE)3 (DeepGAR)4 (LSTM)5 (ChemVAE)6 (GrammarVAE)7 (GraphVAE)8 [27]Ring count0246124HexPentQuadTri152025CEPDBOtherFONC5101520ZINC4812Atom countQM9515256101418Bond count102030TripleDoubleSingle\ffor the smaller QM9 dataset), ChemVAE [8], GrammarVAE [18], GraphVAE [30], and the graph\nmodel from [28]. We train these and on our three datasets and then sample 20k molecules from the\ntrained models (in the case of [22, 28], we obtained sets of sampled molecules from the authors).\nWe analyze the methods using two sets of metrics. First in Fig. 3(a) we show metrics from existing\nwork: syntactic validity, novelty (i.e. fraction of sampled molecules not appearing in the training data)\nand uniqueness (i.e. ratio of sample set size before and after deduplication of identical molecules).\nSecond, in Fig. 3(b) we introduce new metrics to measure how well each model captures the\ndistribution of molecules in the training set. Speci\ufb01cally, we measure the average number of each\natom type and each bond type in the sampled molecules, and we count the average number of 3-, 4-,\n5-, and 6-membered cycles in each molecule. This latter metric is chemically relevant because 3- and\n4-membered rings are typically unstable due to their high ring strain. Fig. 3(c) shows 2 samples from\nour model for each dataset and we show more samples of generated molecules in the supplementary\nmaterial.\nThe results in Fig. 3 show that CGVAE is excellent at matching graph statistics, while generating valid,\nnovel and unique molecules for all datasets considered (additional details are found in supplementary\nmaterial B and C). The only competitive baselines are DeepGAR from Li et al. [22] and an LSTM\nlanguage model. Our approach has three advantages over these baselines: First, whereas >10%\nof ZINC-like molecules generated by DeepGAR are invalid, our masking mechanism guarantees\nmolecule validity. An LSTM is surprisingly effective at generating valid molecules, however, LSTMs\ndo not permit the injection of domain knowledge (e.g. valence rules or requirement for the existance\nof a particular scaffold) because meaningful constraints cannot be imposed on the \ufb02at SMILES\nrepresentation during generation. Second, we train a shallow model on breadth-\ufb01rst steps rather than\nfull paths and therefore do not experience problems with training instability or over\ufb01tting that are\ndescribed in Li et al. [22]. Empirical indication for over\ufb01tting in DeepGAR is seen by the fact that\nLi et al. [22] achieves the lowest novelty score on the ZINC dataset, suggesting that it more often\nreplays memorized construction traces. It is also observed in the LSTM case, where on average 60%\nof each generated SMILES string is copied from the nearest neighbour in the training set. Converting\nour generated graphs to SMILES strings reveals only 40% similarity to the nearest neighbour in the\nsame metric. Third we are able to use our continuous latent space for molecule optimization (see\nbelow).\nWe also perform an ablation study on our method. For brevity we\nonly report results using our ring count metrics, and other statistics\nshow similar behavior. From all our experiments we highlight three\naspects that are important choices to obtain good results, and we\nreport these in ablation experiments A, B and C in Fig. 4.\nIn\nexperiment A we remove the distance feature dv,u from \u03c6 and see\nthat this harms performance on the larger molecules in the ZINC\ndataset. More interestingly, we see poor results in experiment B\nFigure 4: Ablation study us-\nwhere we make an independence assumption on edge generation\ning the ring metric. 1 indicates\n(i.e. use features \u03c6 to calculate independent probabilities for all\nstatistics of the datasets, 2 of\npossible edges and sample an entire molecule in one step). We also\nour model and A,B,C of the\nsee poor results in experiment C where we remove the GGNN from\nablations discussed in the text.\nthe decoder (i.e. perform sequential construction with h(t)\nv = h(0)\nv ).\nThis indicates that the choice to perform sequential decoding with GGNN node updates before each\ndecision are the keys to the success of our model.\n\n6.2 Directed molecule generation\n\nFinally, we show that we can use the VAE structure of our method to direct the molecule generation\ntowards especially interesting molecules. As discussed in Sect. 4.3 (and \ufb01rst shown by G\u00f3mez-\nBombarelli et al. [8] in this setting), we extend our architecture to predict the Quantitative Estimate\nof Drug-Likeness (QED) directly from latent space. This allows us to generate molecules with very\nhigh QED values by performing gradient ascent in the latent space using the trained QED-scoring\nnetwork. Fig. 5 shows an interpolation sequence from a point in latent space with an low QED value\n(which ranges between 0 and 1) to the local maximum. For each point in the sequence, the \ufb01gure\nshows a generated molecule, the QED value our architecture predicts for this molecule, as well as the\nQED value computed by RDKit.\n\n8\n\n02412ABCZINCHexPentQuadTri02412ABCRing countQM9\f0.5686\n0.5345\n\nPred. QED\nReal QED\nFigure 5: Trajectory of QED-directed optimization in latent space. Additional examples are shown in\nsupplementary material D.\n\n0.6685\n0.6584\n\n0.7539\n0.7423\n\n0.8376\n0.8298\n\n0.9013\n0.8936\n\n0.9271\n0.9383\n\n7 Conclusion\n\nWe proposed CGVAE, a sequential generative model for graphs built from a VAE with GGNNs in\nthe encoder and decoder. Using masks that enforce chemical rules, we specialized our model to\nthe application of molecule generation and achieved state-of-the-art generation and optimization\nresults. We introduced basic statistics to validate the quality of our generated molecules. Future\nwork will need to link to the chemistry community to de\ufb01ne additional metrics that further guide the\nconstruction of models and datasets for real-world molecule design tasks.\n\nReferences\n[1] R. Albert and A.-L. Barab\u00e1si. Statistical mechanics of complex networks. Reviews of modern\n\nphysics, 74(1):47, 2002.\n\n[2] M. Allamanis, M. Brockschmidt, and M. Khademi. Learning to represent programs with graphs.\n\nIn ICLR, 2018.\n\n[3] J. Bredt, J. Houben, and P. Levy. Ueber isomere dehydrocamphers\u00e4uren, lauronols\u00e4uren und\nbihydrolauro-lactone. Berichte der deutschen chemischen Gesellschaft, 35(2):1286\u20131292, 1902.\n[4] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs\n\nwith fast localized spectral \ufb01ltering. In NIPS, 2016.\n\n[5] P. Erd\u00f6s and A. R\u00e9nyi. On random graphs, i. Publicationes Mathematicae (Debrecen), 6:\n\n290\u2013297, 1959.\n\n[6] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for\n\nquantum chemistry. arXiv preprint arXiv:1704.01212, 2017.\n\n[7] R. G\u00f3mez-Bombarelli, J. Aguilera-Iparraguirre, T. D. Hirzel, D. Duvenaud, D. Maclaurin, M. A.\nBlood-Forsythe, H. S. Chae, M. Einzinger, D.-G. Ha, T. Wu, et al. Design of ef\ufb01cient molecular\norganic light-emitting diodes by a high-throughput virtual screening and experimental approach.\nNature materials, 15(10):1120, 2016.\n\n[8] R. G\u00f3mez-Bombarelli, D. K. Duvenaud, J. M. Hern\u00e1ndez-Lobato, J. Aguilera-Iparraguirre, T. D.\nHirzel, R. P. Adams, and A. Aspuru-Guzik. Automatic chemical design using a data-driven\ncontinuous representation of molecules. ACS Central Science, 4(2):268\u2013276, 2018.\n\n[9] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In\n\nIJCNN, 2005.\n\n[10] J. Hachmann, C. Rom\u00e1n-Salgado, K. Trepte, A. Gold-Parker, M. Blood-Forsythe, L. Seress,\nR. Olivares-Amaya, and A. Aspuru-Guzik. The Harvard clean energy project database http:\n//cepdb.molecularspace.org. http://cepdb.molecularspace.org.\n\n[11] J. Hachmann, R. Olivares-Amaya, S. Atahan-Evrenk, C. Amador-Bedolla, R. S. S\u00e1nchez-\nCarrera, A. Gold-Parker, L. Vogt, A. M. Brockway, and A. Aspuru-Guzik. The harvard clean\nenergy project: large-scale computational screening and design of organic photovoltaics on the\nworld community grid. The Journal of Physical Chemistry Letters, 2(17):2241\u20132251, 2011.\n\n[12] J. J. Irwin, T. Sterling, M. M. Mysinger, E. S. Bolstad, and R. G. Coleman. Zinc: a free\ntool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):\n1757\u20131768, 2012.\n\n[13] W. Jin, R. Barzilay, and T. Jaakkola. Junction tree variational autoencoder for molecular graph\ngeneration. In Proceedings of the 36th international conference on machine learning (ICML),\n2018.\n\n9\n\nNHONOH2NNHOHNHBrNNSOONOHClONNONHNHHOOBrNHONHNONNHONHSONOFNONNONH\f[14] D. D. Johnson. Learning graphical state transitions. ICLR, 2017.\n[15] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[16] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel. Neural relational inference for\n\ninteracting systems. In ICML, 2018.\n\n[17] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nICLR, 2017.\n\n[18] M. J. Kusner, B. Paige, and J. M. Hern\u00e1ndez-Lobato. Grammar variational autoencoder. CoRR,\n\nabs/1703.01925, 2017.\n\n[19] G. Landrum. Rdkit: Open-source cheminformatics. http://www.rdkit.org, 2014.\n[20] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs:\nAn approach to modeling networks. Journal of Machine Learning Research, 11(Feb):985\u20131042,\n2010.\n\n[21] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks.\n\nICLR, 2016.\n\n[22] Y. Li, O. Vinyals, C. Dyer, R. Pascanu, and P. Battaglia. Learning deep generative models of\n\ngraphs. CoRR, abs/1803.03324, 2018.\n\n[23] D. Neil, M. Segler, L. Guasch, M. Ahmed, D. Plumbley, M. Sellwood, and N. Brown. Exploring\ndeep recurrent models with reinforcement learning for molecule design. ICLR workshop, 2018.\n[24] M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen. Molecular de-novo design through deep\n\nreinforcement learning. Journal of cheminformatics, 9(1):48, 2017.\n\n[25] X. Qi, R. Liao, J. Jia, S. Fidler, and R. Urtasun. 3D graph neural networks for RGBD\nsemantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 5199\u20135208, 2017.\n\n[26] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld. Quantum chemistry structures\n\nand properties of 134 kilo molecules. Scienti\ufb01c data, 1:140022, 2014.\n\n[27] L. Ruddigkeit, R. Van Deursen, L. C. Blum, and J.-L. Reymond. Enumeration of 166 billion\norganic small molecules in the chemical universe database gdb-17. Journal of chemical\ninformation and modeling, 52(11):2864\u20132875, 2012.\n\n[28] B. Samanta, A. De, N. Ganguly, and M. Gomez-Rodriguez. Designing random graph models\nusing variational autoencoders with applications to chemical design. CoRR, abs/1802.05283,\n2018.\n\n[29] M. H. Segler, T. Kogej, C. Tyrchan, and M. P. Waller. Generating focused molecule libraries for\n\ndrug discovery with recurrent neural networks. ACS Central Science, 2017.\n\n[30] M. Simonovsky and N. Komodakis. Towards variational generation of small graphs. In ICLR\n\n[Workshop Track], 2018.\n\n[31] T. A. Snijders and K. Nowicki. Estimation and prediction for stochastic blockmodels for graphs\n\nwith latent block structure. Journal of classi\ufb01cation, 14(1):75\u2013100, 1997.\n\n[32] O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. ICLR,\n\n2016.\n\n[33] D. Weininger. Smiles, a chemical language and information system. 1. introduction to method-\nology and encoding rules. Journal of chemical information and computer sciences, 28(1):31\u201336,\n1988.\n\n[34] S. Yeung, A. Kannan, Y. Dauphin, and L. Fei-Fei. Tackling over-pruning in variational\n\nautoencoders. arXiv preprint arXiv:1706.03643, 2017.\n\n[35] J. You, R. Ying, X. Ren, W. L. Hamilton, and J. Leskovec. Graphrnn: A deep generative model\n\nfor graphs. arXiv preprint arXiv:1802.08773, 2018.\n\n10\n\n\f", "award": [], "sourceid": 4855, "authors": [{"given_name": "Qi", "family_name": "Liu", "institution": "Facebook AI Research"}, {"given_name": "Miltiadis", "family_name": "Allamanis", "institution": "Microsoft Research"}, {"given_name": "Marc", "family_name": "Brockschmidt", "institution": "Microsoft Research"}, {"given_name": "Alexander", "family_name": "Gaunt", "institution": "Microsoft Research"}]}