{"title": "Retrosynthesis Prediction with Conditional Graph Logic Network", "book": "Advances in Neural Information Processing Systems", "page_first": 8872, "page_last": 8882, "abstract": "Retrosynthesis is one of the fundamental problems in organic chemistry. The task is to identify reactants that can be used to synthesize a specified product molecule. Recently, computer-aided retrosynthesis is finding renewed interest from both chemistry and computer science communities. Most existing approaches rely on template-based models that define subgraph matching rules, but whether or not a chemical reaction can proceed is not defined by hard decision rules. In this work, we propose a new approach to this task using the Conditional Graph Logic Network, a conditional graphical model built upon graph neural networks that learns when rules from reaction templates should be applied, implicitly considering whether the resulting reaction would be both chemically feasible and strategic. We also propose an efficient hierarchical sampling to alleviate the computation cost. While achieving a significant improvement of 8.2% over current state-of-the-art methods on the benchmark dataset, our model also offers interpretations for the prediction.", "full_text": "Retrosynthesis Prediction with\n\nConditional Graph Logic Network\n\nHanjun Dai\u2021\u2020\u21e4, Chengtao Li2, Connor W. Coley\u21e7, Bo Dai\u2021, Le Song\u2020\n\n\u2021Google Research, Brain Team, {hadai, bodai}@google.com\n\n2Galixir Inc., chengtao.li@galixir.com\n\n\u21e7Massachusetts Institute of Technology, ccoley@mit.edu\n\n\u2020Georgia Institute of Technology, Ant Financial, lsong@cc.gatech.edu\n\nAbstract\n\nRetrosynthesis is one of the fundamental problems in organic chemistry. The task\nis to identify reactants that can be used to synthesize a speci\ufb01ed product molecule.\nRecently, computer-aided retrosynthesis is \ufb01nding renewed interest from both\nchemistry and computer science communities. Most existing approaches rely on\ntemplate-based models that de\ufb01ne subgraph matching rules, but whether or not a\nchemical reaction can proceed is not de\ufb01ned by hard decision rules. In this work,\nwe propose a new approach to this task using the Conditional Graph Logic Network,\na conditional graphical model built upon graph neural networks that learns when\nrules from reaction templates should be applied, implicitly considering whether\nthe resulting reaction would be both chemically feasible and strategic. We also\npropose an ef\ufb01cient hierarchical sampling to alleviate the computation cost. While\nachieving a signi\ufb01cant improvement of 8.1% over current state-of-the-art methods\non the benchmark dataset, our model also offers interpretations for the prediction.\n\n1\n\nIntroduction\n\nRetrosynthesis planning is the procedure of identifying a series of reactions that lead to the synthesis\nof target product. It is \ufb01rst formalized by E. J. Corey [1] and now becomes one of the fundamental\nproblems in organic chemistry. Such problem of \u201cworking backwards from the target\u201d is challenging,\ndue to the size of the search space\u2013the vast numbers of theoretically-possible transformations\u2013and thus\nrequires the skill and creativity from experienced chemists. Recently, various computer algorithms [2]\nwork in assistance to experienced chemists and save them tremendous time and effort.\nThe simplest formulation of retrosynthesis is to take the target product as input and predict possible\nreactants 1. It is essentially the \u201creverse problem\u201d of reaction prediction. In reaction prediction, the\nreactants (sometimes reagents as well) are given as the input and the desired outputs are possible\nproducts. In this case, atoms of desired products are the subset of reactants atoms, since the side\nproducts are often ignored (see Fig 1). Thus models are essentially designed to identify this subset in\nreactant atoms and reassemble them to be the product. This can be treated as a deductive reasoning\nprocess. In sharp contrast, retrosynthesis is to identify the superset of atoms in target products, and\nthus is an abductive reasoning process and requires \u201ccreativity\u201d to be solved, making it a harder\nproblem. Although recent advances in graph neural networks have led to superior performance in\nreaction prediction [3, 4, 5], such advances do not transfer to retrosynthesis.\nComputer-aided retrosynthesis designs have been deployed over the past years since [6]. Some of\nthem are completely rule-based systems [7] and do not scale well due to high computation cost and\n\n\u21e4Work done while Hanjun was at Georgia Institute of Technology\n1We will focus on this \u201csingle step\u201d version of retrosynthesis in our paper.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fReactions\n\nO\n\nO\n\nO\n\nO\n\nO\n\nO\n\nS\n\nO\n\nNH\n\nO\n\nF\n\nF\n\nO\n\nF\n\nF\n\nF\n\nF\n\nF\n\nO\n\nO\n\nO\n\nF\n\nF\n\nN\nH\n\nO\n\nO\n\nO\n\nNH\n\nO\n\nH2N\n\nO\n\nC O\nO O\n\nS\n\nO\n\nO\n\nN\n\nNH\n\nO\n\nO\n\nO\n\nO\n\nO\n\nO\n\nF\n\nF\n\nF\n\nO\n\nNH\n\nO\n\nNH\n\nC\nC\n\nF O\nF F\n\nO\n\nF\n\nF\n\nF\n\nC:1\n\nC:3\n\nC:2\n\nN:5\n\nC:4\n\nN:4\n\nC:5\n\nC:1\n\nO:2\n\nRetrosynthesis Templates\nheteroatom alkylation and arylation\n\nO\n\nC:2\n\n+\n\nC:3\n\nC:1\n\nN:4\n\nC:5\n\nO\n\nS\n\nO\n\nacylation and related processes\nC:3\n\nC:3\n\nO\n\nC:1\n\nO:2\n\nF\n\nF\n\nF\n\n+\n\nC:4\n\nN:5\n\nFigure 1: Chemical reactions and the retrosynthesis templates. The reaction centers are highlighted in\neach participant of the reaction. These centers are then extracted to form the corresponding template.\nNote that the atoms belong to the reaction side products (the dashed box in \ufb01gure) are missing.\nincomplete coverage of the rules, especially when rules are expert-de\ufb01ned and not algorithmically\nextracted [2]. Despite these limitations, they are very useful for encoding chemical transformations\nand easy to interpret. Based on this, the retrosim [8] uses molecule and reaction \ufb01ngerprint similarities\nto select the rules to apply for retrosynthesis. Other approaches have used neural classi\ufb01cation models\nfor this selection task [9]. On the other hand, recently there have also been attempts to use the\nsequence-to-sequence model to directly predict SMILES 2 representation of reactants [10, 11] (and\nfor the forward prediction problem, products [12, 13]). Albeit simple and expressive, these approaches\nignore the rich chemistry knowledge and thus require huge amount of training. Also such models\nlack interpretable reasoning behind their predictions.\nThe current landscape of computer-aided synthesis planning motivated us to pursue an algorithm\nthat shares the interpretability of template-based methods while taking advantage of the scalability\nand expressiveness of neural networks to learn when such rules apply. In this paper, we propose\nConditional Graph Logic Network towards this direction, where chemistry knowledge about reaction\ntemplates are treated as logic rules and a conditional graphical model is introduced to tolerate the\nnoise in these rules. In this model, the variables are molecules while the synthetic relationships to\nbe inferred are de\ufb01ned among groups of molecules. Furthermore, to handle the potentially in\ufb01nite\nnumber of possible molecule entities, we exploit the neural graph embedding in this model.\nOur contribution can be summarized as follows:\n1) We propose a new graphical model for the challenging retrosynthesis task. Our model brings both\nthe bene\ufb01t of the capacity from neural embeddings, and the interpretability from tight integration\nof probabilistic models and chemical rules.\n\n2) We propose an ef\ufb01cient hierarchical sampling method for approximate learning by exploiting\nthe structure of rules. Such algorithm not only makes the training feasible, but also provides\ninterpretations for predictions.\n\n3) Experiments on the benchmark datasets show a signi\ufb01cant 8.1% improvement over existing\n\nstate-of-the-art methods in top-one accuracy.\n\nOther related work: Recently there have been works using machine learning to enhance the\nrule systems. Most of them treat the rule selection as multi-class classi\ufb01cation [9] or hierarchical\nclassi\ufb01cation [14] where similar rules are grouped into subcategories. One potential issue is that the\nmodel size grows with the number of rules. Our work directly models the conditional joint probability\nof both rules and the reactants using embeddings, where the model size is invariant to the rules.\nOn the other hand, researchers have tried to tackle the even harder problem of multi-step retrosyn-\nthesis [15, 16] using single-step retrosynthesis as a subroutine. So our improvement in single-step\nretrosynthesis could directly transfer into improvement of multi-step retrosynthesis [8].\n\n2 Background\nA chemical reaction can be seen as a transformation from set of N reactant molecules {Ri}N\ni=1 to\nan outcome molecule O. Without loss of generality, we work with single-outcome reactions in this\npaper, as this is a standard formulation of the retrosynthetic problem and multi-outcome reactions\ncan be split into multiple single-outcome ones. We refer to the set of atoms changed (e.g., bond\nbeing added or deleted) during the reaction as reaction centers. Given a reaction, the corresponding\n\n2https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html.\n\n2\n\n\fretrosynthesis template T is represented by a subgraph pattern rewriting rule 3\n\nN (T ),\n\n1 + rT\n\n2 + . . . + rT\n\nT := oT ! rT\n\n(1)\nwhere N (\u00b7) represents the number of reactant subgraphs in the template, as illustrated in Figure. 1.\nGenerally we can treat the subgraph pattern oT as the extracted reaction center from O, and rT\ni , i 2\n1, 2, . . . , N (T ) as the corresponding pattern inside i-th reactant, though practically this will include\nneighboring structures of reaction centers as well.\nWe \ufb01rst introduce the notations to represent these chemical entities:\n\u2022 Subgraph patterns: we use lower case letters to represent the subgraph patterns.\n\u2022 Molecule: we use capital letters to represent the molecule graphs. By default, we use O for an\noutcome molecule, and R for a reactant molecule, or M for any molecule in general.\n\u2022 Set: sets are represented by calligraphic letters. We use M to denote the full set of possible\nmolecules, T to denote all extracted retrosynthetic templates, and F to denote all the subgraph\npatterns that are involved in the known templates. We further use Fo to denote the subgraphs\nappearing in reaction outcomes, and Fr to denote those appearing in reactants, with F = FoSFr.\nTask: Given a production or target molecule O, the goal of a one-step retrosynthetic analysis is to\nidentify a set of reactant molecules R2 P(M) that can be used to synthesize the target O. Here\nP(M) is the power set of all molecules M.\n3 Conditional Graph Logic Network\nLet I[m \u2713 M ] : F \u21e5 M 7! {0, 1} be the predicate that indicates whether subgraph pattern m is\na subgraph inside molecule M. This can be checked via subgraph matching. Then the use of a\nretrosynthetic template T : oT ! rT\nN (T ) for reasoning about a reaction can be\ndecomposed into two-step logic. First,\n\n2 + . . . + rT\n\n1 + rT\n\ni=1\n\nI[rT\n\ni \u2713 R\u21e1(i)],\n\n(2)\nwhere the subgraph pattern oT from the reaction template T is matched against the product O, i.e.,\noT is a subgraph of the product O. Second,\n\nI. Match template: O(T ) := I[oT \u2713 O] \u00b7 I[T 2T ],\nII. Match reactants: O,T (R) := O(T ) \u00b7 I[|R| = N (T )] \u00b7QN (T )\n\n(3)\nwhere the set of subgraph patterns {r1, . . . , rN (T )} from the reaction template are matched against\nthe set of reactants R. The logic is that the size of the set of reactant R has to match the number of\npatterns in the reaction template T , and there exists a permutation \u21e1(\u00b7) of the elements in the reactant\nset R such that each reactant matches a corresponding subgraph pattern in the template.\nSince there will still be uncertainty in whether the reaction is possible from a chemical perspective\neven when the template matches, we want to capture such uncertainty by allowing each template/or\nlogic reasoning rule to have a different con\ufb01dence score. More speci\ufb01cally, we will use a template\nscore function w1(T, O) given the product O, and the reactant score function w2(R, T, O) given the\ntemplate T and the product O. Thus the overall probabilistic models for the reaction template T and\nthe set of molecules R are designed as\n\nI. Match template:\nII. Match reactants:\n\np(T|O) / exp (w1(T, O)) \u00b7 O(T ),\np(R|T, O) / exp (w2(R, T, O)) \u00b7 O,T (R).\n\n(4)\n(5)\nGiven the above two step probabilistic reasoning models, the joint probability of a single-step\nretrosythetic proposal using reaction template T and reactant set R can be written as\n(6)\np (R, T|O) / exp (w1 (T, O) + w2 (R, T, O)) \u00b7 O (T ) O,T (R) ,\nIn this energy-based model, whether the graphical model (GM) is directed or undirected is a design\nchoice. We will present our directed GM design and the corresponding partition function in Sec 4\nshortly. We name our model as Conditional Graph Logic Network (GLN) (Fig. 2), as it is a conditional\ngraphical model de\ufb01ned with logic rules, where the logic variables are graph structures (i.e., molecules,\nsubgraph patterns, etc.). In this model, we assume that satisfying the templates is a necessary condition\nfor the retrosynthesis, i.e., p (R, T|O) 6= 0 only if O (T ) and O,T (R) are nonzero. Such restriction\nprovides sparse structures into the model, and makes this abductive type of reasoning feasible.\n\n3Commonly encoded using SMARTS/SMIRKS patterns\n\n3\n\n\f/\n\n\u2131\n\n\u2133\n\n!('|%,#)\n!(#|%)\n\n*\u2254#\u2192{'}\n!(\u211b|%,*)\n\nFigure 2: Retrosynthesis pipeline with GLN. The three dashed boxes from top to bottom represent\nset of templates T , subgraphs F and molecules M. Different colors represent retrosynthesis routes\nwith different templates. The dashed lines represent potentially possible routes that are not observed.\nReaction centers in products O are highlighted.\n\nReaction type conditional model: In some situations when performing the retrosynthetic analysis,\nthe human expert may already have a certain type c of reaction in mind. In this case, our model can\nbe easily adapted to incorporate this as well:\n\np(R, T|O, c) / exp (w1 (T, O) + w2 (R, T, O)) \u00b7 O (T ) O,T (R) I[T 2T c]\n\nwhere Tc is the set of retrosynthesis templates that belong to reaction type c.\nGLN is related but signi\ufb01cantly different from Markov Logic Network (MLN, which also uses\ngraphical model to model uncertainty in logic rules). MLN treats the predicates of logic rules as\nlatent variables, and the inference task is to get the posterior for them. While in GLN, the task is the\nstructured prediction, and the predicates are implemented with subgraph matching. We show more\ndetails on this connection in Appendix A.\n\n(7)\n\n4 Model Design\nAlthough the model we de\ufb01ned so far has some nice properties, the design of the components plays\na critical role in capturing the uncertainty in the retrosynthesis. We \ufb01rst describe a decomposable\ndesign of p(T|O) in Sec. 4.1, for learning and sampling ef\ufb01ciency consideration; then in Sec. 4.2 we\ndescribe the parameterization of the scoring functions w1, w2 in detail.\n4.1 Decomposable design of p(T|O)\nDepending on how speci\ufb01c the reaction rules are, the template set T could be as large as the\ntotal number of reactions in extreme case. Thus directly model p(T|O) can lead to dif\ufb01culties in\nlearning and inference. By revisiting the logic rule de\ufb01ned in Eq. (2), we can see the subgraph\npattern oT plays a critical role in choosing the template. Since we represent the templates as\nT = (oT ! rT\ni N (T )\ni=1 ), it is natural to decompose the energy function w1(T, O) in Eq. (4) as\n, O\u2318. Meanwhile, recall the template matching rule is also\nw1(T, O) = v1oT , O + v2\u21e3rT\ni N (T )\np(T|O) = p(oT ,rT\nZ(O)expv1(oT , O) \u00b7 I\u21e5oT 2 O\u21e4\u21e3exp\u21e3v2\u21e3rT\ni=1 ) 2T ]\u2318 ,\ni N (T )\nZ (O) =Po2F exp (v1(o, O)) \u00b7 I [o 2 O] \u00b7\u21e3P{r}2P(F) exp (v2 ({r} , O)) \u00b7 I[(o !{ r}) 2T ]\u2318\nHere we abuse the notation a bit to denote the set of subgraph patterns as {r}.\nWith such decomposition, we can further speed up both the training and inference for p(T|O), since\nthe number of valid reaction centers per molecule and number of templates per reaction center\nare much smaller than total number of templates. Speci\ufb01cally, we can sample T \u21e0 p(T|O) by\n\n= 1\nwhere the partition function Z (O) is de\ufb01ned as:\n\ndecomposable, so we obtain the resulting template probability model as:\n\n, O\u2318\u2318 \u00b7 I[(oT !rT\n\ni N (T )\n\ni=1\n\ni N (T )\n\ni=1\n\ni=1\n\n|O)\n\n(8)\n\n(9)\n\n4\n\n\f\ufb01rst sampling reaction center p(o|O) / exp (v1(o, O)) \u00b7 I [o 2 O] and then choosing the subgraph\npatterns for reactants p({r}|O, o) / exp (v2 ({r} , O) \u00b7 I[(o !{ r}) 2T ]). In the end we obtain\nthe templated represented as (o !{ r}).\nIn the literature there have been several attempts for modeling and learning p(T|O), e.g., multi-class\nclassi\ufb01cation [9] or multiscale model with human de\ufb01ned template hierarchy [14]. The proposed\ndecomposable design follows the template speci\ufb01cation naturally, and thus has nice graph structure\nparameterization and interpretation as will be covered in the next subsection.\nFinally the directed graphical model design of Eq. (6) is written as\np(R, T|O) =\n\ni=1 \u2318 + w2 (R, T, O)\u2318\u2318 \u00b7 O (T ) O,T (R)\ni N (T )\nwhere Z(T, O) =PR2P(M) exp (w2(R, T, O)) \u00b7 O,T (R) sums over all subsets of molecules.\n\nZ(O)Z(T,O) exp\u21e3\u21e3v1oT , O + v2\u21e3rT\n\n(10)\n\n1\n\n4.2 Graph Neuralization for v1, v2 and w2\n\nhl+1\n\nhl\n\n(12)\n\nhl+1\n\nSince the arguments of the energy functions w1, w2 are molecules, which can be represented by\ngraphs, one natural choice is to design the parameterization based on the recent advances in graph\nneural networks (GNN) [17, 18, 19, 20, 21, 22]. Here we \ufb01rst present a brief review of the general\nform of GNNs, and then explain how we can utilize them to design the energy functions.\n\nThe graph embedding is a function g : MSF 7! Rd that maps a graph into d-dimensional vector.\nWe denote G = (V G,E G) as the graph representation of some molecule or subgraph pattern, where\nV G = {vi}|V G|\ni=1 is the set of bonds (edges).\nWe represent each undirected bond as two directional edges. Generally, the embedding of the graph\nis computed through the node embeddings hvi that are computed in an iterative fashion. Speci\ufb01cally,\nlet h0\nvi = xvi initially, where xvi is a vector of node features, like the atomic number, aromaticity,\netc. of the corresponding atom. Then the following update operator is applied recursively:\n\ni=1 is the set of atoms (nodes) and E G =ei = (e1\n\ni ) |E G|\n\ni , e2\n\n(\u27134xu!v))\n\nv = F (xv,(hl\n\n(11)\nThis procedure repeats for L steps. While there are many design choices for the so-called message\npassing operator F , we use the structure2vec [21] due to its simplicity and ef\ufb01cient c++ binding with\nRDKit. Finally we have the parameterization\n\nu, xu!v u2N (v)) where xu!v is the feature of edge u ! v.\nv = (\u27131xv + \u27132 Xu2N (v)\n\nu + \u27133 Xu2N (v)\n\nwhere (\u00b7) is some nonlinear activation function, e.g., relu or tanh, and \u2713 = {\u27131, . . . ,\u2713 4} are the\nlearnable parameters. Let the node embedding hv = hL\nv be the last output of F , then the \ufb01nal graph\n|V G|Pv2V G hv. Note that\nembedding is obtained via averaging over node embeddings: g(G) = 1\nattention [23] or other order invariant aggregation can also be used for such aggregation.\nWith the knowledge of GNN, we introduce the concrete parametrization for each component:\n\u2022 Parameterizing v1: Given a molecule O, v1 can be viewed as a scoring function of possible\nreaction centers inside O. Since the subgraph pattern o is also a graph, we parameterize it with inner\nproduct, i.e., v1(o, O) = g1(o)>g2(O). Such form can be treated as computing the compatibility\nbetween o and O. Note that due to our design choice, v1(o, O) can be written as v1(o, O) =\n\ni=1 varies for different template\nT . Inspired by the DeepSet [24], we use average pooling over the embeddings of each subgraph\npattern to represent this set. Speci\ufb01cally,\n\nPv2V O h>v g1(o). Such form allows us to see the contribution of compatibility from each atom in O.\n\u2022 Parameterizing v2: The size of set of subgraph patternsrT\ni N (T )\n, O) = g3(O)>0@ 1\ni ))1A\nN (T )Xi=1\ni N (T )\nv2(rT\ng6(R)! .\nw2(R, T, O) = g5(O)> 1\n|R| XR2R\n\n\u2022 Parameterizing w2: This energy function also needs to take the set as input. Following the same\ndesign as v2, we have\n\ng4(rT\n\nN (T )\n\n(13)\n\n(14)\n\ni=1\n\n5\n\n\fNote that our GLN framework isn\u2019t limited to the speci\ufb01c parameterization above and is compatible\nwith other parametrizations. For example, one can use condensed graph of reaction [25] to represent\nR as a single graph. Other chemistry specialized GNNs [3, 26] can also be easily applied here. For\nthe ablation study on these design choices, please refer to Appendix C.1.\n5 MLE with Ef\ufb01cient Inference\nGiven dataset D = {(Oi, Ti,Ri)}|D|\ni=1 with |D| reactions, we denote the parameters in\nw1 (T, O) , w2 (T,R, O) as \u21e5= ( \u27131,\u2713 2), respectively. The maximum log-likelihood estima-\ntion (MLE) is a natural choice for parameter estimation. Since 8 (O, T,R) \u21e0D , O (T ) = 1\nand O,T (R) = 1, we have the MLE optimization as\n:= bED [log p (R|T, O) p (T|O)]\n(15)\n= bED [w1 (T, O) + w2 (R, T, O) log Z (O) log Z (O, T )] ,\n+bED [r\u21e5w2 (R, T, O)] bEO,T ER|T,O [r\u21e5w2 (R, T, O)] ,\n\nr\u21e5` (\u21e5) = bED [r\u21e5w1 (T, O)] bEOET|O [r\u21e5w1 (T, O)]\n\nThe gradient of ` (\u21e5) w.r.t. \u21e5 can be derived4 as\n\n` (\u21e5)\n\n(16)\n\nmax\n\n\u21e5\n\nwhere ET|O [\u00b7] and ER|O,T [\u00b7] stand for the expectation w.r.t. current model p (T|O) and p (R, T|O),\nrespectively. With the gradient estimator (16), we can apply the stochastic gradient descent (SGD)\nalgorithm for optimizing (15).\nEf\ufb01cient inference for gradient approximation: Since R2 P(M) is a combinatorial space, gen-\nerally the expensive MCMC algorithm is required for sampling from p (R|T, O) to approximate (16).\nHowever, this can be largely accelerated by scrutinizing the logic property in the proposed model. Re-\ncall that the matching between template and reactants is the necessary condition for p (R, T|O) 0\nby design. On the other hand, given O, only a few templates T with reactants R have nonzero O (T )\nand O,T (R). Then, we can sample T and R by importance sampling on restricted supported\ntemplates instead of MCMC over P (M). Rigorously, given O, we denote the matched templates as\nTO and the matched reactants based on T as RT,O, where\n(17)\nThen, the importance sampling leads to an\n\nTO = {T : O (T ) 6= 0,8T 2T } and RT,O = {R : O,T (R) 6= 0,8R 2 P (M)}\nAlgorithm 1 Importance Sampling for br\u21e5` (\u21e5)\nunbiased gradient approximation br\u21e5` (\u21e5)\nas illustrated in Algorithm 1. To make the\n1: Input (R, T, O) \u21e0D , p (R|T, O) and p (T|O).\nalgorithm more ef\ufb01cient in practice, we\n2: Construct TO according to O (T ).\nhave adopted the following accelerations:\n3: Sample \u02dcT / exp (w1 (T, O)) , 8T 2T O in hierar-\n\u2022 1) Decomposable modeling of p(T|O)\n4: Construct RT,O according to O,T (R).\n\u2022 2) Cache the computed TO and R (T, O)\n5: Sample \u02dcR/ exp (w2 (R, T, O)).\n6: Compute stochastic approximation br\u21e5` (\u21e5) with\nsample\u21e3R, T, \u02dcR, \u02dcT , O\u2318 by (16).\nIn a dataset with 5 \u21e5 104 reactions, |TO|\nis about 80 and |RT,O| is roughly 10 on\naverage. Therefore, we reduce the actual\ncomputational cost to a manageable constant. We further reduce the computation cost of sampling\nby generating the T and R uniformly from the support. Although these samples only cover the\nsupport of the model, we avoid the calculation of the forward pass of neural networks, achieving\nbetter computational complexity. In our experiment, such an approximation already achieves state-of-\nthe-art results. We would expect recent advances in energy based models would further boost the\nperformance, which we leave as future work to investigate.\nRemark on RT,O: Note that to get all possible sets of reactants that match the reaction template T\nand product O, we can ef\ufb01ciently use graph edit tools without limiting the reactants to be known in\nthe dataset. This procedure works as follows: given a template T = oT ! rT\n1) Enumerate all matches between subgraph pattern oT and target product O.\n2) Instantiate a copy of the reactant atoms according to rT\n3) Copy over all of the connected atoms and atom properties from O.\n\nchical way, as in Sec. 4.1.\n\nas described in Sec. 4.1;\n\nN for each match.\n\nN,\n1 + . . . + rT\n\nin advance.\n\n1 , . . . , rT\n\n4We adopt the conventions 0 log 0 = 0 [27], which is justi\ufb01ed by continuity since x log x ! 0 as x ! 0.\n\n6\n\n\fi })}k\n\nThis process is a routine in most Cheminformatics packages. In our paper we use runReactants\nfrom RDKit with the improvement of stereochemistry handling 5 to realize this.\nFurther acceleration via beam search: Given a product O, the prediction involves \ufb01nding the\npair (R, T ) that maximizes p(R, T|O). One possibility is to \ufb01rst enumerate T 2 T (O) and then\nR2R T,O. This is acceptable by exploiting the sparse support property induced by logic rules.\nA more ef\ufb01cient way is to use beam search with size k. Firstly we \ufb01nd k reaction centers {oi}k\ni=1 with\ni=1 we score the corresponding v2({r} , O)\u00b7 I [(o !{ r}) 2T ].\ntop v1(o, O). Next for each o 2{ oi}k\nIn this stage the top k pairs {(oTj ,{rTj\nj=1 (i.e., the templates) that maximize v1(o|O) +\nv2({r} , O) are kept. Finally using these templates, we choose the best R2 Sk\nj=1 RTj ,O that\nmaximizes total score w1 (T, O) + w2 (R, T, O). Fig. 2 provides a visual explanation.\n6 Experiment\nDataset: We mainly evaluate our method on a benchmark dataset named USPTO-50k, which\ncontains 50k reactions of 10 different types in the US patent literature. We use exactly the same\ntraining/validation/test splits as Coley et al. [8], which contain 80%/10%/10% of the total 50k\nreactions. Table 1 contains the detailed information about the benchmark. Additionally, we also build\na dataset from the entire USPTO 1976-2016 to verify the scalability of our method.\nBaselines: Baseline algorithms consist of rule-based ones and neural network-based ones, or both.\nThe expertSys is an expert system based on retrosynthetic reaction rules, where the rule is selected\naccording to the popularity of the corresponding reaction type. The seq2seq [10] and transformer [11]\nare neural sequence-to-sequence-based learning model [28] implemented with LSTM [29] or Trans-\nformer [30]. These models encode the canonicalized SMILES representation of the target compound\nas input, and directly output canonical SMILES of reactants. We also include some data-driven\ntemplate-based models. The retrosim [8] uses direct calculation of molecular similarities to rank the\nrules and resulting reactants. The neuralsym [9] models p(T|O) as multi-class classi\ufb01cation using\nMLP. All the results except neuralsym are obtained from their original reports, since we have the\nsame experiment setting. Since neuralsym is not open-source, we reimplemented it using their best\nreported ELU512 model with the same method for parameter tuning.\nEvaluation metric: The evaluation metric we used is the top-k exact match accuracy, which is\ncommonly used in the literature. This metric compares whether the predicted set of reactants are\nexactly the same as ground truth reactants. The comparison is performed between canonical SMILES\nstrings generated by RDKit.\nSetup of GLN: We use rdchiral [31] to extract the retrosynthesis templates from the training set.\nAfter removing duplicates, we obtained 11,647 unique template rules in total for USPTO-50k. These\nrules represent 93.3% coverage of the test set. That is to say, for each test instance we try to apply\nthese rules and see if any of the rules gives exact match. Thus this is the theoretical upper bound of\nthe rule-based approach using this particular degree of speci\ufb01city, which is high enough for now. For\nmore information about the statistics of these rules, please refer to Table 2.\nWe train our model for up to 150k updates with batch size of 64. It takes about 12 hours to train with a\nsingle GTX 1080Ti GPU. We tune embedding sizes in {128, 256}, GNN layers {3, 4, 5} and GNN ag-\ngregation in {max, mean, sum} using validation set. Our code is released at https://github.com/Hanjun-\nDai/GLN. More details are included in Appendix B.\n\n6.1 Main results\nWe present the top-k exact match accuracy in Table 3, where k ranges from {1, 3, 5, 10, 20, 50}. We\nevaluate both the reaction class unknown and class conditional settings. Using the reaction class as\nprior knowledge represents some situations where the chemists already have an idea of how they\nwould like to synthesize the product.\nIn all settings, our proposed GLN outperforms the baseline algorithms. And particularly for top-1\naccuracy, our model performs signi\ufb01cantly better than the second best method, with 8.1% higher\naccuracy with unknown reaction class, and 8.9% higher with reaction class given. This demonstrates\nthe advantage of our method in this dif\ufb01cult setting and potential applicability in reality.\n\n5https://github.com/connorcoley/rdchiral.\n\n7\n\n\fUSPTO 50k\n\n# train\n# val\n# test\n# rules\n\n# reaction types\n\n40,008\n5,001\n5,007\n11,647\n\n10\n\nTable 1: Dataset information.\n\nRule coverage\n# unique centers\n\nAvg. # centers per mol\nAvg. # rules per mol\n\nAvg. # reactants\n\n93.3%\n9,078\n29.31\n83.85\n1.71\n\nTable 2: Reaction and tem-\nplate set information.\n\nmethods\n\n1\n\n3\n\n5\n\n10\n\n20\n\n50\n\nTop-k accuracy %\n\ntransformer[11]\n\nretrosim[8]\nneuralsym[9]\n\nGLN\n\nexpertSys[10]\nseq2seq[10]\nretrosim[8]\nneuralsym[9]\n\nGLN\n\n57.3\n54.7\n65.3\n69.0\n\n62.7\n63.3\n72.4\n75.6\n\n/\n\nReaction class unknown\n37.9\n37.3\n44.4\n52.5\n\n74.1\n78.9\n83.7\nReaction class given as prior\n65.1\n61.7\n88.1\n85.1\n90.0\n\n52.3\n52.4\n73.8\n76.0\n79.1\n\n59.1\n57.0\n81.2\n81.4\n85.2\n\n35.4\n37.4\n52.9\n55.3\n64.2\n\n/\n\n82.0\n82.2\n89.0\n\n68.6\n65.9\n91.8\n86.5\n92.3\n\n/\n\n85.3\n83.1\n92.4\n\n69.5\n70.7\n92.9\n86.9\n93.2\n\nTable 3: Top-k exact match accuracy.\n\nO\n\nO\n\nO\n\nO\n\nO\n\nO\n\nO\n\nO\n\nN\n\nNH\n\nNH\n\nCl\n\nO\n\nO\n\nN\n\nNH\n\nNH\n\nCl\n\nO\n\nO\n\nN\n\nNH\n\nNH\n\nCl\n\nO\n\nO\n\nN\n\nNH\n\nNH\n\nCl\n\nO\n\nO\n\nO\n\nO\n\nNH\n\nS\n\nO\n\nCl\n\nCl\n\nO\n\nO\n\nNH\n\nS\n\nO\n\nCl\n\nCl\n\nO\n\nO\n\nNH\n\nS\n\nO\n\nO\n\nO\n\nNH\n\nS\n\nO\n\nO\n\nN\n\nO\n\nNH 2\n\nO\n\nS\n\nNH\n\nO\n\nO\n\nO\n\nO\n\nCl\n\nNH\n\nOH\n\nGround truth\n\nCl\n\nO\n\nN\n\nO\n\nNH\n\nO\n\nS\n\nO\n\nO\n\nOH\n\nCl\n\nO\n\nSimilarity=0.9\n\nCl\n\nO\n\nN\n\nO\n\nO\n\nN\n\nO\n\nCorrect\n\nO\n\nS\n\nNH\n\nO\n\nO\n\nCl\n\nNH\n\nO\n\nS\n\nO\n\nO\n\nO\n\nOH\n\nCl\n\nNH\n\nO\n\nOH\n\nCl\n\nO\n\nSimilarity=0.9\n\nCl\n\nNH 2\n\nNH\n\nO\n\nNH 2\n\nNH 2\n\nNH\n\nO\n\nN\n\nN\n\nN\n\nN\n\nO\n\nS\n\nN\n\nO\n\nN\n\nO\n\nS\n\nO\n\nF\n\nNH\n\nNH\n\nO\n\nF\n\nI\n\nO\n\nS\n\nN\n\nO\n\nN\n\nF\n\nN H\n\nN H\n\nO\n\nS\n\nO\n\nN\n\nO\n\nGround truth\n\nF\n\nO\n\nS\n\nN\n\nO\n\nO\n\nS\n\nN\n\nO\n\nO\n\nS\n\nN\n\nO\n\nN\n\nN\n\nN\n\nO\n\nS\n\nO\n\nO\n\nS\n\nO\n\nO\n\nS\n\nO\n\nF\n\nNH\n\nF\n\nF\n\nNH\n\nF\n\nF\n\nNH\n\nF\n\nNH\n\nO\n\nNH\n\nO\n\nNH\n\nO\n\nF\n\nOH\n\nO\n\nF\n\nO\n\nS\n\nNH\n\nO\n\nO\n\nN\n\nS\n\nO\n\nN\n\nN\n\nSimilarity=0.82\n\nF\n\nNH\n\nO\n\nS\n\nO\n\nN\n\nO\n\nN\n\nS\n\nO\n\nNH\n\nO\n\nSimilarity=0.87\n\nF\n\nO\n\nS\n\nNH\n\nO\n\nF\n\nOH\n\nO\n\nF\n\nN\n\nSimilarity=0.82\n\nNH2\n\nO\n\nNH\n\nO\n\nN\n\nS\n\nO\n\nN\n\nNH2\n\nFigure 3: Example successful predictions.\n\nFigure 4: Example failed predictions.\n\nMoreover, our performance in the reaction class unknown setting even outperforms expertSys\nand seq2seq in the reaction conditional setting. Since the transformer paper didn\u2019t report top-k\nperformance for k > 10, we leave it as blank. Meanwhile, Karpov et al. [11] also reports the result\nwhen training using training+validation set and tuning on the test set. With this extra priviledge,\nthe top-1 accuracy of transformer is 42.7% which is still worse than our performance. This shows\nthe bene\ufb01t of our logic powered deep neural network model comparing to purely neural models,\nespecially when the amount of data is limited.\nSince the theoretical upper bound of this rule-based implementation is 93.3%, the top-50 accuracy\nfor our method in each setting is quite close to this limit. This shows the probabilistic model we built\nmatches the actual retrosynthesis target well.\n\n6.2\n\nInterpret the predictions\n\nVisualizing the predicted synthesis: In Fig 3 and 4, we visualize the ground truth reaction and\nthe top 3 predicted reactions (see Appendix C.6 for high resolution \ufb01gures). For each reaction, we\nalso highlight the corresponding reaction cores (i.e., the set of atoms get changed). This is done\nby matching the subgraphs from predicted retrosynthesis template with the target compound and\ngenerated reactants, respectively. Fig 3 shows that our correct prediction also gets almost the same\nreaction cores predicted as the ground truth. In this particular case, the explanation of our prediction\naligns with the existing reaction knowledge.\nFig 4 shows a failure mode where none of the top-3 prediction matches. In this case we calculated\nthe similarity between predicted reactants and ground truth ones using Dice similarity from RDKit.\nWe \ufb01nd these are still similar in the molecule \ufb01ngerprint level, which suggests that these predictions\ncould be the potentially valid but unknown ones in the literature.\n\n8\n\n\fTop-1 prediction\n\nBottom-1 prediction\n\nTrue reaction core\n\nTop-1 prediction\n\nBottom-1 prediction\n\nTrue reaction core\n\nl\n\ns\ne\nu\nc\ne\no\nM\n\nl\n\ns\nr\ne\nt\nn\ne\nC\n\nFigure 5: Reaction center prediction visualization. Red atoms indicate positive match scores, while\nblue ones having negative scores. The darkness of the color shows the magnitude of the score. Green\nparts highlight the substructure match between molecules and center structures.\nVisualizing the reaction center prediction: Here we visualize the prediction of probabilistic mod-\neling of reaction center. This is done by calculating the inner product of each atom embedding in\ntarget molecule with the subgraph pattern embedding. Fig 5 shows the visualization of scores on\nthe atoms that are part of the reaction center. The top-1 prediction assigns positive scores to these\natoms (red ones), while the bottom-1 prediction (i.e., prediction with least probability) assigns large\nnegative scores (blue ones). Note that although the reaction center in molecule and the corresponding\nsubgraph pattern have the same structure, the matching scores differ a lot. This suggests that the\nmodel has learned to predict the activity of substructures inside molecule graphs.\n\n6.3 Study of the performance\nIn addition to the overall numbers in Table 3, we provide detailed study of the performances. This\nincludes per-category performance, the accuracy of each module in hierarchical sampling and also\nthe effect of the beam size. Due to the space limit, please refer to Appendix C.\n\n32.8\n56.1\n\n35.8\n60.8\n\nretrosim\n\nneuralsym GLN\n39.3\ntop-1\n63.7\ntop-10\nTable 4: Top-k accuracy on USPTO-full.\n\n6.4 Large scale experiments on USPTO-full\nTo see how this method scales up with the dataset\nsize, we create a large dataset from the entire set\nof reactions from USPTO 1976-2016. There are\n1,808,937 raw reactions in total. For the reactions\nwith multiple products, we duplicate them into multiple ones with one product each. After removing\nthe duplications and reactions with wrong atom mappings, we obtain roughly 1M unique reactions,\nwhich are further divided into train/valid/test sets with size 800k/100k/100k.\nWe train on single GPU for 3 days and report with the model having best validation accuracy. The\nresults are presented in Table 4. We compare with the best two baselines from previous sections.\nDespite the noisiness of the full USPTO set relative to the clean USPTO-50k, our method still\noutperforms the two best baselines in top-k accuracies.\n7 Discussion\nEvaluation: Retrosynthesis usually does not have a single right answer. Evaluation in this work is to\nreproduce what is reported for single-step retrosynthesis. This is a good, but imperfect benchmark,\nsince there are potentially many reasonable ways to synthesize a single product.\nLimitations: We share the limitations of all template-based methods. In our method, the template\ndesigns, more speci\ufb01cally, their speci\ufb01cities, remain as a design art and are hard to decide beforehand.\nAlso, the scalability is still an issue since we rely on subgraph isomorphism during preprocessing.\nFuture work: The subgraph isomorphism part can potentially be replaced with predictive model,\nwhile during inference the fast inner product search [32] can be used to reduce computation cost. Also\nactively building templates or even inducing new ones could enhance the capacity and robustness.\n\nAcknowledgments\nWe would like to thank anonymous reviewers for providing constructive feedbacks. This project\nwas supported in part by NSF grants CDS&E-1900017 D3SC, CCF-1836936 FMitF, IIS-1841351,\nCAREER IIS-1350983 to L.S.\n\n9\n\n\fReferences\n[1] Elias JAMES Corey. The logic of chemical synthesis: multistep synthesis of complex carbogenic molecules\n\n(nobel lecture). Angewandte Chemie International Edition in English, 30(5):455\u2013465, 1991.\n\n[2] Connor W. Coley, William H. Green, and Klavs F. Jensen. Machine learning in computer-aided synthesis\n\nplanning. 51(5):1281\u20131289, . doi: 10.1021/acs.accounts.8b00087.\n\n[3] Wengong Jin, Connor Coley, Regina Barzilay, and Tommi Jaakkola. Predicting organic reaction outcomes\nwith weisfeiler-lehman network. In Advances in Neural Information Processing Systems, pages 2607\u20132616,\n2017.\n\n[4] Connor W. Coley, Wengong Jin, Luke Rogers, Timothy F. Jamison, Tommi S. Jaakkola, William H. Green,\nRegina Barzilay, and Klavs F. Jensen. A graph-convolutional neural network model for the prediction of\nchemical reactivity. 10(2):370\u2013377, . doi: 10.1039/C8SC04228D.\n\n[5] John Bradshaw, Matt J Kusner, Brooks Paige, Marwin HS Segler, and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. A\n\ngenerative model for electron paths. 2018.\n\n[6] EJ Corey and W Todd Wipke. Computer-assisted design of complex organic syntheses. Science, 166\n\n(3902):178\u2013192, 1969.\n\n[7] Sara Szymkuc, Ewa P. Gajewska, Tomasz Klucznik, Karol Molga, Piotr Dittwald, Micha\u0142 Startek, Micha\u0142\nBajczyk, and Bartosz A. Grzybowski. Computer-assisted synthetic planning: The end of the beginning. 55\n(20):5904\u20135937. doi: 10.1002/anie.201506101.\n\n[8] Connor W Coley, Luke Rogers, William H Green, and Klavs F Jensen. Computer-assisted retrosynthesis\n\nbased on molecular similarity. ACS Central Science, 3(12):1237\u20131245, 2017.\n\n[9] Marwin H. S. Segler and Mark P. Waller. Neural-symbolic machine learning for retrosynthesis and reaction\n\nprediction. 23(25):5966\u20135971. doi: 10.1002/chem.201605499.\n\n[10] Bowen Liu, Bharath Ramsundar, Prasad Kawthekar, Jade Shi, Joseph Gomes, Quang Luu Nguyen, Stephen\nHo, Jack Sloane, Paul Wender, and Vijay Pande. Retrosynthetic reaction prediction using neural sequence-\nto-sequence models. ACS Central Science, 3(10):1103\u20131113, 2017.\n\n[11] Pavel Karpov, Guillaume Godin, and I Tetko. A transformer model for retrosynthesis. 2019.\n\n[12] Philippe Schwaller, Theophile Gaudin, David Lanyi, Costas Bekas, and Teodoro Laino. \u201cfound in\ntranslation\u201d: predicting outcomes of complex organic chemistry reactions using neural sequence-to-\nsequence models. Chemical science, 9(28):6091\u20136098, 2018.\n\n[13] Philippe Schwaller, Teodoro Laino, Theophile Gaudin, Peter Bolgar, Costas Bekas, and Alpha A. Lee.\nMolecular transformer for chemical reaction prediction and uncertainty estimation. doi: 10.26434/chemrxiv.\n7297379.v1.\n\n[14] Javier L. Baylon, Nicholas A. Cilfone, Jeffrey R. Gulcher, and Thomas W. Chittenden. Enhancing\nretrosynthetic reaction prediction with deep learning using multiscale reaction classi\ufb01cation. 59(2):\n673\u2013688.\n\n[15] Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural\n\nnetworks and symbolic ai. Nature, 555(7698):604, 2018.\n\n[16] John S Schreck, Connor W Coley, and Kyle JM Bishop. Learning retrosynthetic planning through self-play.\n\narXiv preprint arXiv:1901.06569, 2019.\n\n[17] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The\n\ngraph neural network model. IEEE Transactions on Neural Networks, 20(1):61\u201380, 2008.\n\n[18] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al\u00e1n\nAspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints.\nIn Advances in neural information processing systems, pages 2224\u20132232, 2015.\n\n[19] Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architectures from sequence\nand graph kernels. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 2024\u20132033. JMLR. org, 2017.\n\n[20] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks.\n\narXiv preprint arXiv:1511.05493, 2015.\n\n10\n\n\f[21] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured\n\ndata. In International conference on machine learning, pages 2702\u20132711, 2016.\n\n[22] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In\n\nAdvances in Neural Information Processing Systems, pages 1024\u20131034, 2017.\n\n[23] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio.\n\nGraph attention networks. arXiv preprint arXiv:1710.10903, 2017.\n\n[24] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan R Salakhutdinov, and\nAlexander J Smola. Deep sets. In Advances in neural information processing systems, pages 3391\u20133401,\n2017.\n\n[25] Frank Hoonakker, Nicolas Lachiche, Alexandre Varnek, and Alain Wagner. Condensed graph of reaction:\n\nconsidering a chemical reaction as one single pseudo molecule.\n\n[26] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message\npassing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-\nVolume 70, pages 1263\u20131272. JMLR. org, 2017.\n\n[27] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.\n\n[28] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In\n\nAdvances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[29] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u20131780,\n\n1997.\n\n[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141ukasz\nKaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing\nsystems, pages 5998\u20136008, 2017.\n\n[31] Connor W Coley, William H Green, and Klavs F Jensen. Rdchiral: An rdkit wrapper for handling\nstereochemistry in retrosynthetic template extraction and application. Journal of chemical information and\nmodeling, .\n\n[32] Ruiqi Guo, Sanjiv Kumar, Krzysztof Choromanski, and David Simcha. Quantization based fast inner\n\nproduct search. In Arti\ufb01cial Intelligence and Statistics, pages 482\u2013490, 2016.\n\n[33] Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1-2):107\u2013136,\n\n2006.\n\n[34] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[35] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?\n\narXiv preprint arXiv:1810.00826, 2018.\n\n11\n\n\f", "award": [], "sourceid": 4761, "authors": [{"given_name": "Hanjun", "family_name": "Dai", "institution": "Georgia Tech"}, {"given_name": "Chengtao", "family_name": "Li", "institution": "MIT"}, {"given_name": "Connor", "family_name": "Coley", "institution": "MIT"}, {"given_name": "Bo", "family_name": "Dai", "institution": "Google Brain"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}]}