{"title": "Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network", "book": "Advances in Neural Information Processing Systems", "page_first": 2607, "page_last": 2616, "abstract": "The prediction of organic reaction outcomes is a fundamental problem in computational chemistry. Since a reaction may involve hundreds of atoms, fully exploring the space of possible transformations is intractable. The current solution utilizes reaction templates to limit the space, but it suffers from coverage and efficiency issues. In this paper, we propose a template-free approach to efficiently explore the space of product molecules by first pinpointing the reaction center -- the set of nodes and edges where graph edits occur. Since only a small number of atoms contribute to reaction center, we can directly enumerate candidate products.  The generated candidates are scored by a Weisfeiler-Lehman Difference Network that models high-order interactions between changes occurring at nodes across the molecule. Our framework outperforms the top-performing template-based approach with a 10% margin, while running orders of magnitude faster. Finally, we demonstrate that the model accuracy rivals the performance of domain experts.", "full_text": "Predicting Organic Reaction Outcomes with\n\nWeisfeiler-Lehman Network\n\nWengong Jin\u2020 Connor W. Coley\u2021 Regina Barzilay\u2020 Tommi Jaakkola\u2020\n\n\u2020{wengong,regina,tommi}@csail.mit.edu, \u2021ccoley@mit.edu\n\n\u2020Computer Science and Arti\ufb01cial Intelligence Lab, MIT\n\n\u2021Department of Chemical Engineering, MIT\n\nAbstract\n\nThe prediction of organic reaction outcomes is a fundamental problem in computa-\ntional chemistry. Since a reaction may involve hundreds of atoms, fully exploring\nthe space of possible transformations is intractable. The current solution utilizes\nreaction templates to limit the space, but it suffers from coverage and ef\ufb01ciency\nissues. In this paper, we propose a template-free approach to ef\ufb01ciently explore the\nspace of product molecules by \ufb01rst pinpointing the reaction center \u2013 the set of nodes\nand edges where graph edits occur. Since only a small number of atoms contribute\nto reaction center, we can directly enumerate candidate products. The generated\ncandidates are scored by a Weisfeiler-Lehman Difference Network that models\nhigh-order interactions between changes occurring at nodes across the molecule.\nOur framework outperforms the top-performing template-based approach with a\n10% margin, while running orders of magnitude faster. Finally, we demonstrate\nthat the model accuracy rivals the performance of domain experts.\n\n1\n\nIntroduction\n\nOne of the fundamental problems in organic chemistry is the prediction of which products form as\na result of a chemical reaction [16, 17]. While the products can be determined unambiguously for\nsimple reactions, it is a major challenge for many complex organic reactions. Indeed, experimentation\nremains the primary manner in which reaction outcomes are analyzed. This is time consuming,\nexpensive, and requires the help of an experienced chemist. The empirical approach is particularly\nlimiting for the goal of automatically designing ef\ufb01cient reaction sequences that produce speci\ufb01c\ntarget molecule(s), a problem known as chemical retrosynthesis [16, 17].\nViewing molecules as labeled graphs over atoms, we propose to formulate the reaction prediction\ntask as a graph transformation problem. A chemical reaction transforms input molecules (reactants)\ninto new molecules (products) by performing a set of graph edits over reactant molecules, adding\nnew edges and/or eliminating existing ones. Given that a typical reaction may involve more than 100\natoms, fully exploring all possible transformations is intractable. The computational challenge is\nhow to reduce the space of possible edits effectively, and how to select the product from among the\nresulting candidates.\nThe state-of-the-art solution is based on reaction templates (Figure 1). A reaction template speci\ufb01es a\nmolecular subgraph pattern to which it can be applied and the corresponding graph transformation.\nSince multiple templates can match a set of reactants, another model is trained to \ufb01lter candidate\nproducts using standard supervised approaches. The key drawbacks of this approach are coverage\nand scalability. A large number of templates is required to ensure that at least one can reconstitute the\ncorrect product. The templates are currently either hand-crafted by experts [7, 1, 15] or generated\nfrom reaction databases with heuristic algorithms [2, 11, 3]. For example, Coley et al. [3] extracts\n140K unique reaction templates from a database of 1 million reactions. Beyond coverage, applying a\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: An example reaction where the reaction center is (27,28), (7,27), and (8,27), highlighted in\ngreen. Here bond (27,28) is deleted and (7,27) and (8,27) are connected by aromatic bonds to form a\nnew ring. The corresponding reaction template consists of not only the reaction center, but nearby\nfunctional groups that explicitly specify the context.\n\ntemplate involves graph matching and this makes examining large numbers of templates prohibitively\nexpensive. The current approach is therefore limited to small datasets with limited types of reactions.\nIn this paper, we propose a template-free approach by learning to identify the reaction center, a small\nset of atoms/bonds that change from reactants to products. In our datasets, on average only 5.5%\nof the reactant molecules directly participate in the reaction. The small size of the reaction centers\ntogether with additional constraints on bond formations enables us to directly enumerate candidate\nproducts. Our forward-prediction approach is then divided into two key parts: (1) learning to identify\nreaction centers and (2) learning to rank the resulting enumerated candidate products.\nOur technical approach builds on neural embedding of the Weisfeiler-Lehman isomorphism test.\nWe incorporate a speci\ufb01c attention mechanism to identify reaction centers while leveraging distal\nchemical effects not accounted for in related convolutional representations [5, 4]. Moreover, we\npropose a novel Weisfeiler-Lehman Difference Network to learn to represent and ef\ufb01ciently rank\ncandidate transformations between reactants and products.\nWe evaluate our method on two datasets derived from the USPTO [13], and compare our methods\nto the current top performing system [3]. Our method achieves 83.9% and 77.9% accuracy on two\ndatasets, outperforming the baseline approach by 10%, while running 140 times faster. Finally, we\ndemonstrate that the model outperforms domain experts by a large margin.\n\n2 Related Work\n\nTemplate-based Approach Existing machine learning models for product prediction are mostly\nbuilt on reaction templates. These approaches differ in the way templates are speci\ufb01ed and in the\nway the \ufb01nal product is selected from multiple candidates. For instance, Wei et al. [18] learns to\nselect among 16 pre-speci\ufb01ed, hand-encoded templates, given \ufb01ngerprints of reactants and reagents.\nWhile this work was developed on a narrow range of chemical reaction types, it is among the \ufb01rst\nimplementations that demonstrates the potential of neural models for analyzing chemical reactions.\nMore recent work has demonstrated the power of neural methods on a broader set of reactions. For\ninstance, Segler and Waller [14] and Coley et al. [3] use a data-driven approach to obtain a large set of\ntemplates, and then employ a neural model to rank the candidates. The key difference between these\napproaches is the representation of the reaction. In Segler and Waller [14], molecules are represented\nbased on their Morgan \ufb01ngerprints, while Coley et al. [3] represents reactions by the features of\natoms and bonds in the reaction center. However, the template-based architecture limits both of these\nmethods in scaling up to larger datasets with more diversity.\nTemplate-free Approach Kayala et al. [8] also presented a template-free approach to predict reac-\ntion outcomes. Our approach differs from theirs in several ways. First, Kayala et al. operates at the\nmechanistic level - identifying elementary mechanistic steps rather than the overall transformations\nfrom reactants to products. Since most reactions consist of many mechanistic steps, their approach\n\n2\n\n\fFigure 2: Overview of our approach. (1) we train a model to identify pairwise atom interactions\nin the reaction center. (2) we pick the top K atom pairs and enumerate chemically-feasible bond\ncon\ufb01gurations between these atoms. Each bond con\ufb01guration generates a candidate outcome of the\nreaction. (3) Another model is trained to score these candidates to \ufb01nd the true product.\n\nrequires multiple predictions to ful\ufb01ll an entire reaction. Our approach operates at the graph level -\npredicting transformations from reactants to products in a single step. Second, mechanistic descrip-\ntions of reactions are not given in existing reaction databases. Therefore, Kayala et al. created their\ntraining set based on a mechanistic-level template-driven expert system. In contrast, our model is\nlearned directly from real-world experimental data. Third, Kayala et al. uses feed-forward neural net-\nworks where atoms and graphs are represented by molecular \ufb01ngerprints and additional hand-crafted\nfeatures. Our approach builds from graph neural networks to encode graph structures.\nMolecular Graph Neural Networks The question of molecular graph representation is a key issue\nin reaction modeling. In computational chemistry, molecules are often represented with Morgan\nFingerprints, boolean vectors that re\ufb02ect the presence of various substructures in a given molecule.\nDuvenaud et al. [5] developed a neural version of Morgan Fingerprints, where each convolution\noperation aggregates features of neighboring nodes as a replacement of the \ufb01xed hashing function.\nThis representation was further expanded by Kearnes et al. [9] into graph convolution models. Dai\net al. [4] consider a different architecture where a molecular graph is viewed as a latent variable\ngraphical model. Their recurrent model is derived from Belief Propagation-like algorithms. Gilmer\net al. [6] generalized all previous architectures into message-passing network, and applied them to\nquantum chemistry. The closest to our work is the Weisfeiler-Lehman Kernel Network proposed\nby Lei et al. [12]. This recurrent model is derived from the Weisfeiler-Lehman kernel that produces\nisomorphism-invariant representations of molecular graphs. In this paper, we further enhance this\nrepresentation to capture graph transformations for reaction prediction.\n\n3 Overview\n\nOur approach bypasses reaction templates by learning a reaction center identi\ufb01er. Speci\ufb01cally, we\ntrain a neural network that operates on the reactant graph to predict a reactivity score for every\npair of atoms (Section 3.1). A reaction center is then selected by picking a small number of atom\npairs with the highest reactivity scores. After identifying the reaction center, we generate possible\nproduct candidates by enumerating possible bond con\ufb01gurations between atoms in the reaction center\n(Section 3.2) subject to chemical constraints. We train another neural network to rank these product\ncandidates (represented as graphs, together with the reactants) so that the correct reaction outcome is\nranked highest (Section 3.3). The overall pipeline is summarized in Figure 2. Before describing the\ntwo modules in detail, we formally de\ufb01ne some key concepts used throughout the paper.\nChemical Reaction A chemical reaction is a pair of molecular graphs (Gr, Gp), where Gr is\ncalled the reactants and Gp the products. A molecular graph is described as G = (V, E), where\nV = {a1, a2,\u00b7\u00b7\u00b7 , an} is the set of atoms and E = {b1, b2,\u00b7\u00b7\u00b7 , bm} is the set of associated bonds of\nvarying types (single, double, aromatic, etc.). Note that Gr is has multiple connected components\n\n3\n\n\fsince there are multiple molecules comprising the reactants. The reactions used for training are\natom-mapped so that each atom in the product graph has a unique corresponding atom in the reactants.\nReaction Center A reaction center is a set of atom pairs {(ai, aj)}, where the bond type between ai\nand aj differs from Gr to Gp. In other words, a reaction center is a minimal set of graph edits needed\nto transform reactants to products. Since the reported reactions in the training set are atom-mapped,\nreaction centers can be identi\ufb01ed automatically given the product.\n\n3.1 Reaction Center Identi\ufb01cation\nIn a given reaction R = (Gr, Gp), each atom pair (au, av) in Gr is associated with a reactivity label\nyuv 2{ 0, 1} specifying whether their relation differs between reactants and products. The label is\ndetermined by comparing Gr and Gp with the help of atom-mapping. We predict the label on the\nbasis of learned atom representations that incorporate contextual cues from the surrounding chemical\nenvironment. In particular, we build on a Weisfeiler-Lehman Network (WLN) that has shown superior\nresults against other learned graph representations in the narrower setting of predicting chemical\nproperties of individual molecules [12].\n\n3.1.1 Weisfeiler-Lehman Network (WLN)\nThe WLN is inspired by the Weisfeiler-Lehman isomorphism test for labeled graphs. The architecture\nis designed to embed the computations inherent in WL isomorphism testing to generate learned\nisomorphism-invariant representations for atoms.\nWL Isomorphism Test The key idea of the isomorphism test is to repeatedly augment node labels\nby the sorted set of node labels of neighbor nodes and to compress these augmented labels into new,\nshort labels. The initial labeling is the atom element. In each iteration, its label is augmented with the\nelement labels of its neighbors. Such a multi-set label is compactly represented as a new label by a\nhash function. Let c(L)\nbe the \ufb01nal label of atom av. The molecular graph G = (V, E) is represented\nu , buv, c(L)\nas a set {(c(L)\n) | (u, v) 2 E}, where buv is the bond type between u and v. Two graphs are\nv\nsaid to be isomorphic if their set representations are the same. The number of distinct labels grows\nexponentially with the number of iterations L.\nWL Network The discrete relabeling process does not directly generalize to continuous feature\nvectors. Instead, we appeal to neural networks to continuously embed the computations inherent\nin the WL test. Let r be the analogous continuous relabeling function. Then a node v 2 G with\nneighbor nodes N (v), node features fv, and edge features fuv is \u201crelabeled\u201d according to\n\nv\n\nr(v) = \u2327 (U1fv + U2 Xu2N (v)\n\n\u2327 (V[fu, fuv]))\n\n(1)\n\nwhere \u2327 (\u00b7) could be any non-linear function. We apply this relabeling operation iteratively to obtain\ncontext-dependent atom vectors\n(2)\n\nv = \u2327 (U1h(l1)\nh(l)\n\n\u2327 (V[h(l1)\n\n, fuv]))\n\nu\n\nv\n\n(1 \uf8ff l \uf8ff L)\n\n+ U2 Xu2N (v)\n\nv = fv and U1, U2, V are shared across layers. The \ufb01nal atom representations arise from\n\nwhere h(0)\nmimicking the set comparison function in the WL isomorphism test, yielding\nu  W(1)fuv  W(2)h(L)\n\nW(0)h(L)\n\n(3)\n\nv\n\ncv = Xu2N (v)\n\nThe set comparison here is realized by matching each rank-1 edge tensor h(L)\nto a set\nof reference edges also cast as rank-1 tensors W(0)[k] \u2326 W(1)[k] \u2326 W(2)[k], where W[k] is the\nk-th row of matrix W. In other words, Eq. 3 above could be written as\n\nu \u2326 fuv \u2326 h(L)\n\nv\n\ncv[k] = Xu2N (v)DW(0)[k] \u2326 W(1)[k] \u2326 W(2)[k], h(L)\n\nv E\nu \u2326 fuv \u2326 h(L)\n\n(4)\n\nThe resulting cv is a vector representation that captures the local chemical environment of the atom\n(through relabeling) and involves a comparison against a learned set of reference environments. The\n\nrepresentation of the whole graph G is simply the sum over all the atom representations: cG =Pv cv.\n\n4\n\n\f3.1.2 Finding Reaction Centers with WLN\nWe present two models to predict reactivity: the local and global models. Our local model is based\ndirectly on the atom representations cu and cv in predicting label yuv. The global model, on the other\nhand, selectively incorporates distal chemical effects with the goal of capturing the fact that atoms\noutside of the reaction center may be necessary for the reaction to occur. For example, the reaction\ncenter may be in\ufb02uenced by certain reagents1. We incorporate these distal effects into the global\nmodel through an attention mechanism.\nLocal Model Let cu, cv be the atom representations for atoms u and v, respectively, as returned by\nthe WLN. We predict the reactivity score of (u, v) by passing these through another neural network:\n\nsuv = uT \u2327 (Macu + Macv + Mbbuv)\n\n(5)\nwhere (\u00b7) is the sigmoid function, and buv is an additional feature vector that encodes auxiliary\ninformation about the pair such as whether the two atoms are in different molecules or which type of\nbond connects them.\nGlobal Model Let \u21b5uv be the attention score of atom v on atom u. The global context representation\n\u02dccu of atom u is calculated as the weighted sum of all reactant atoms where the weight comes from\nthe attention module:\n\n\u02dccu = Xv\nsuv = uT \u2327 (Ma\u02dccu + Ma\u02dccv + Mbbuv)\n\n\u21b5uv = uT \u2327 (Pacu + Pacv + Pbbuv)\n\n(7)\nNote that the attention is obtained with sigmoid rather than softmax non-linearity since there may be\nmultiple atoms relevant to a particular atom u.\nTraining Both models are trained to minimize the following loss function:\n\n\u21b5uvcv;\n\n(6)\n\nL(T ) = XR2T Xu6=v2R\n\nyuv log(suv) + (1  yuv) log(1  suv)\n\n(8)\n\nHere we predict each label independently because of the large number of variables. For a given\nreaction with N atoms, we need to predict the reactivity score of O(N 2) pairs. This quadratic\ncomplexity prohibits us from adding higher-order dependencies between different pairs. Nonetheless,\nwe found independent prediction yields suf\ufb01ciently good performance.\n\n3.2 Candidate Generation\n\nWe select the top K atom pairs with the highest predicted reactivity score and designate them,\ncollectively, as the reaction center. The set of candidate products are then obtained by enumerating all\npossible bond con\ufb01guration changes within the set. While the resulting set of candidate products is\nexponential in K, many can be ruled out by invoking additional constraints. For example, every atom\nhas a maximum number of neighbors they can connect to (valence constraint). We also leverage\nthe statistical bias that reaction centers are very unlikely to consist of disconnected components\n(connectivity constraint). Some multi-step reactions do exist that violate the connectivity constraint.\nAs we will show, the set of candidates arising from this procedure is more compact than those arising\nfrom templates without sacri\ufb01cing coverage.\n\n3.3 Candidate Ranking\nThe training set for candidate ranking consists of lists T = {(r, p0, p1,\u00b7\u00b7\u00b7 , pm)}, where r are the\nreactants, p0 is the known product, and p1,\u00b7\u00b7\u00b7 , pm are other enumerated candidate products. The\ngoal is to learn a scoring function that ranks the highest known product p0. The challenge in ranking\ncandidate products is again representational. We must learn to represent (r, p) in a manner that can\nfocus on the key difference between the reactants r and products p while also incorporating the\nnecessary chemical contexts surrounding the changes.\n\n1Molecules that do not typically contribute atoms to the product but are nevertheless necessary for the\n\nreaction to proceed.\n\n5\n\n\fWe again propose two alternative models to score each candidate pair (r, p). The \ufb01rst model naively\nrepresents a reaction by summing difference vectors of all atom representations obtained from a\nWLN on the associated connected components. Our second and improved model, called WLDN,\ntakes into account higher order interactions between these differences vectors.\nWLN with Sum-Pooling Let c(pi)\nmolecule pi. We de\ufb01ne difference vector d(pi)\n\nbe the learned atom representation of atom v in candidate product\n\npertaining to atom v as follows:\n\nv\n\nv\n\nd(pi)\nv = c(pi)\n\nv  c(r)\nv ;\n\ns(pi) = uT \u2327 (MXv2pi\n\nd(pi)\n\nv\n\n)\n\n(9)\n\nRecall that the reactants and products are atom-mapped so we can use v to refer to the same atom.\nThe pooling operation is a simple sum over these difference vectors, resulting in a single vector for\neach (r, pi) pair. This vector is then fed into another neural network to score the candidate product pi.\nWeisfeiler-Lehman Difference Network (WLDN) Instead of simply summing all difference vec-\ntors, the WLDN operates on another graph called a difference graph. A difference graph D(r, pi) is\nde\ufb01ned as a molecular graph which has the same atoms and bonds as pi, with atom v\u2019s feature vector\nreplaced by d(pi)\n. Operating on the difference graph has several bene\ufb01ts. First, in D(r, pi), atom v\u2019s\nfeature vector deviates from zero only if it is close to the reaction center, thus focusing the processing\non the reaction center and its immediate context. Second, D(r, pi) explicates neighbor dependencies\nbetween difference vectors. The WLDN maps this graph-based representation into a \ufb01xed-length\nvector, by applying a separately parameterized WLN on top of D(r, pi):\n\nv\n\nh(pi,l)\n\nv\n\nd(pi,L)\n\nv\n\nv\n\nu\n\n= \u23270@U1h(pi,l1)\n\u2327\u21e3V[h(pi,l1)\n+ U2 Xu2N (v)\n= Xu2N (v)\n W(1)fuv  W(2)h(pi,L)\n. The \ufb01nal score of pi is s(pi) = uT \u2327 (MPv2pi\n\nW(0)h(pi,L)\n\nu\n\nv\n\n, fuv]\u23181A (1 \uf8ff l \uf8ff L)\n\n(10)\n\n(11)\n\nv\n\nv\n\n= d(pi)\n\nwhere h(pi,0)\nTraining Both models are trained to minimize the softmax log-likelihood objective over the scores\n{s(p0), s(p1),\u00b7\u00b7\u00b7 , s(pm)} where s(p0) corresponds to the target.\n4 Experiments\n\nd(pi,L)\nv\n\n).\n\nData As a source of data for our experiments, we used reactions from USPTO granted patents,\ncollected by Lowe [13]. After removing duplicates and erroneous reactions, we obtained a set of\n480K reactions, to which we refer in the paper as USPTO. This dataset is divided into 400K, 40K,\nand 40K for training, development, and testing purposes.2\nIn addition, for comparison purposes we report the results on the subset of 15K reaction from this\ndataset (referred as USPTO-15K) used by Coley et al. [3]. They selected this subset to include\nreactions covered by the 1.7K most common templates. We follow their split, with 10.5K, 1.5K, and\n3K for training, development, and testing.\nSetup for Reaction Center Identi\ufb01cation The output of this component consists of K atom pairs\nwith the highest reactivity scores. We compute the coverage as the proportion of reactions where all\natom pairs in the true reaction center are predicted by the model, i.e., where the recorded product is\nfound in the model-generated candidate set.\nThe model features re\ufb02ect basic chemical properties of atoms and bonds. Atom-level features include\nits elemental identity, degree of connectivity, number of attached hydrogen atoms, implicit valence,\nand aromaticity. Bond-level features include bond type (single, double, triple, or aromatic), whether\nit is conjugated, and whether the bond is part of a ring.\nBoth our local and global models are build upon a Weisfeiler-Lehman Network, with unrolled depth\n3. All models are optimized with Adam [10], with learning rate decay factor 0.9.\n\n2Code and data available at https://github.com/wengong-jin/nips17-rexgen\n\n6\n\n\fMethod\nLocal\nLocal\nGlobal\n\nLocal\nGlobal\n\nUSPTO-15K\nK=6 K=8 K=10\n|\u2713|\n87.7\n572K\n80.1\n1003K 81.6\n89.1\n92.2\n86.7\n756K\nUSPTO\n83.0\n89.8\n\n85.0\n86.1\n90.1\n\n572K\n756K\n\n89.6\n93.3\n\n87.2\n92.0\n\nAvg. Num. of Candidates (USPTO)\n482.3 out of 5006\n1076\n\nTemplate\nGlobal\n\n246.5\n\n-\n-\n\n60.9\n\nMethod\nColey et al.\nWLN\nWLDN\nWLN (*)\nWLDN (*)\n\nMethod\nWLDN\nWLDN (*)\n\nUSPTO-15K\nCov.\n100.0\n90.1\n90.1\n100.0\n100.0\n\nP@1 P@3 P@5\n90.7\n72.1\n74.9\n86.3\n86.8\n76.7\n94.8\n81.4\n84.1\n96.1\nUSPTO\n\n86.6\n84.6\n85.6\n92.5\n94.1\n\n|\u2713|\n3.2M 79.6\n3.2M 83.9\n\nP@1 P@3 P@5\n89.2\n95.2\n\n87.7\n93.2\n\n(b) Candidate Ranking Performance. Precision at ranks\n1,3,5 are reported. (*) denotes that the true product was\nadded if not covered by the previous stage.\n\n(a) Reaction Center Prediction Performance. Coverage\nis reported by picking the top K (K=6,8,10) reactivity\npairs. |\u2713| is the number of model parameters.\n\nTable 1: Model Comparison on USPTO-15K and USPTO dataset.\n\nSetup for Candidate Ranking The goal of this evaluation is to determine whether the model can\nselect the correct product from a set of candidates derived from reaction center. We \ufb01rst compare\nmodel accuracy against the top-performing template-based approach by Coley et al. [3]. This\napproach employs frequency-based heuristics to construct reaction templates and then uses a neural\nmodel to rank the derived candidates. As explained above, due to the scalability issues associated\nwith this baseline, we can only compare on USPTO-15K, which the authors restricted to contain only\nexamples that were instantiated by their most popular templates. For this experiment, we set K = 8\nfor candidate generation, which achieves 90% coverage and yields 250 candidates per reaction. To\ncompare a standard WLN representation against its counterpart with Difference Networks (WLDN),\nwe train them under the same setup on USPTO-15K, \ufb01xing the number of parameters to 650K.\nNext, we evaluate our model on USPTO for large scale evaluation. We set K = 6 for candidate\ngeneration and report the result of the best model architecture. Finally, to factorize the coverage of\ncandidate selection and the accuracy of candidate ranking, we consider two evaluation scenarios: (1)\nthe candidate list as derived from reaction center; (2) the above candidate list augmented with the\ntrue product if not found. This latter setup is marked with (*).\n\n4.1 Results\n\nReaction Center Identi\ufb01cation Table 1a reports the coverage of the model as compared to the real\nreaction core. Clearly, the coverage depends on the number of atom pairs K, with the higher coverage\nfor larger values of K. These results demonstrate that even for K = 8, the model achieves high\ncoverage, above 90%.\nThe results also clearly demonstrate the advantage of the global model over the local one, which is\nconsistent across all experiments. The superiority of the global model is in line with the well-known\nfact that reactivity depends on more than the immediate local environment surrounding the reaction\ncenter. The presence of certain functional groups (structural motifs that appear frequently in organic\nchemistry) far from the reaction center can promote or inhibit different modes of reactivity. Moreover,\nreactivity is often in\ufb02uenced by the presence of reagents, which are separate molecules that may not\ndirectly contribute atoms to the product. Consideration of both of these factors necessitates the use of\na model that can account for long-range dependencies between atoms.\nFigure 3 depicts one such example, where the observed reactivity can be attributed to the presence of\na reagent molecule that is completely disconnected from the reaction center itself. While the local\nmodel fails to anticipate this reactivity, the global one accurately predicts the reaction center. The\nattention map highlights the reagent molecule as the determinant context.\n\n7\n\n\fFigure 3: A reaction that reduces the carbonyl carbon of an amide by removing bond 4-23 (red circle).\nReactivity at this site would be highly unlikely without the presence of borohydride (atom 25, blue\ncircle). The global model correctly predicts bond 4-23 as the most susceptible to change, while the\nlocal model does not even include it in the top ten predictions. The attention map of the global model\nshow that atoms 1, 25, and 26 were determinants of atom 4\u2019s predicted reactivity.\n\n(a) An example where reaction occurs at the \u21b5 carbon\n(atom 7, red circle) of a carbonyl group (bond 8-13),\nalso adjacent to a phenyl group (atoms 1-6). The corre-\nsponding template explicitly requires both the carbonyl\nand part of the phenyl ring as context (atoms 4, 7, 8,\n13), although reactivity in this case does not require the\nadditional speci\ufb01cation of the phenyl group (atom 1).\n\n(b) Performance of reactions with different popularity.\nMRR stands for mean reciprocal rank\n\nFigure 4\n\nCandidate Generation Here we compare the coverage of the generated candidates with the template-\nbased model. Table 1a shows that for K = 6, our model generates an average of 60.1 candidates\nand reaches a coverage of 89.8%. The template-based baseline requires 5006 templates extracted\nfrom the training data (corresponding to a minimum of \ufb01ve precedent reactions) to achieve 90.1%\ncoverage with an average of 482 candidates per example.\nThis weakness of the baseline model can be explained by the dif\ufb01culty in de\ufb01ning general heuristics\nwith which to extract templates from reaction examples. It is possible to de\ufb01ne different levels\nof speci\ufb01city based on the extent to which atoms surrounding the reaction center are included or\ngeneralized [11]. This introduces an unavoidable trade-off between generality (fewer templates,\nhigher coverage, more candidates) and speci\ufb01city (more templates, less coverage, fewer candidates).\nFigure 4a illustrates one reaction example where the corresponding template is rare due to the\nadjacency of the reaction center to both a carbonyl group and a phenyl ring. Because adjacency to\neither group can in\ufb02uence reactivity, both are included as part of the template, although reactivity in\nthis case does not require the additional speci\ufb01cation of the phenyl group.\nThe massive number of templates required for high coverage is a serious impediment for the template\napproach because each template application requires solving a subgraph isomorphism problem.\nSpeci\ufb01cally, it takes on average 7 seconds to apply the 5006 templates to a test instance, while our\nmethod takes less than 50 ms, about 140 times faster.\nCandidate Ranking Table 1b reports the performance on the product prediction task. Since the\nbaseline templates from [3] were optimized on the test and have 100% coverage, we compare its\nperformance against our models to which the correct product is added (WLN(*) and WLDN(*)).\nOur model clearly outperforms the baseline by a wide margin. Even when compared against the\ncandidates automatically computed from the reaction center, WLDN outperforms the baseline in\n\n8\n\n\ftop-1 accuracy. The results also demonstrate that the WLDN model consistently outperforms the\nWLN model. This is consistent with our intuition that modeling higher order dependencies between\nthe difference vectors is advantageous over simply summing over them. Table 1b also shows the\nmodel performance improves when tested on the full USPTO dataset.\nWe further analyze model performance based on the frequency of the underlying transformation\nas re\ufb02ected by the the number of template precedents. In Figure 4b we group the test instances\naccording to their frequency and report the coverage of the global model and the mean reciprocal\nrank (MRR) of the WLDN model on each of them. As expected, our approach achieves the highest\nperformance for frequent reactions. However, it maintains reasonable coverage and ranking accuracy\neven for rare reactions, which are particularly challenging for template-based methods.\n\n4.2 Human Evaluation Study\n\nWe randomly selected 80 reaction examples from the test set, ten from each of the template popularity\nintervals of Figure 4b, and asked ten chemists to predict the outcome of each given its reactants. The\naverage accuracy across the ten performers was 48.2%. Our model achieves an accuracy of 69.1%,\nvery close to the best individual performer who scored 72.0%.\n\nChemist\nOur Model\n\n56.3\n\n50.0\n\n72.0\n\n63.8\n\n66.3\n\n65.0\n\n40.0\n\n58.8\n\n25.0\n\n16.3\n\n69.1\n\nTable 2: Human and model performance on 80 reactions randomly selected from the USPTO test set\nto cover a diverse range of reaction types. The \ufb01rst 8 are chemists with rich experience in organic\nchemistry (graduate, postdoctoral and professor level chemists) and the last two are graduate students\nin chemical engineering who use organic chemistry concepts regularly but have less formal training.\n\n5 Conclusion\n\nWe proposed a novel template-free approach for chemical reaction prediction. Instead of generating\ncandidate products by reaction templates, we \ufb01rst predict a small set of atoms/bonds in reaction center,\nand then produce candidate products by enumerating all possible bond con\ufb01guration changes within\nthe set. Compared to template based approach, our framework runs 140 times faster, allowing us to\nscale to much larger reaction databases. Both our reaction center identi\ufb01er and candidate ranking\nmodel build from Weisfeiler-Lehman Network and its variants that learn compact representation of\ngraphs and reactions. We hope our work will encourage both computer scientists and chemists to\nexplore fully data driven approaches for this task.\n\nAcknowledgement\n\nWe thank Tim Jamison, Darsh Shah, Karthik Narasimhan and the reviewers for their helpful com-\nments. We also thank members of the MIT Department of Chemistry and Department of Chemical\nEngineering who participated in the human benchmarking study. This work was supported by the\nDARPA Make-It program under contract ARO W911NF-16-2-0023.\n\nReferences\n[1] Jonathan H Chen and Pierre Baldi. No electron left behind: a rule-based expert system to predict\nchemical reactions and reaction mechanisms. Journal of chemical information and modeling,\n49(9):2034\u20132043, 2009.\n\n[2] Clara D Christ, Matthias Zentgraf, and Jan M Kriegl. Mining electronic laboratory notebooks:\nanalysis, retrosynthesis, and reaction based enumeration. Journal of chemical information and\nmodeling, 52(7):1745\u20131756, 2012.\n\n[3] Connor W Coley, Regina Barzilay, Tommi S Jaakkola, William H Green, and Klavs F Jensen.\nPrediction of organic reaction outcomes using machine learning. ACS Central Science, 2017.\n\n9\n\n\f[4] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for\n\nstructured data. arXiv preprint arXiv:1603.05629, 2016.\n\n[5] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,\nAl\u00e1n Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning\nmolecular \ufb01ngerprints. In Advances in neural information processing systems, pages 2224\u2013\n2232, 2015.\n\n[6] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural\n\nmessage passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.\n\n[7] Markus Hartenfeller, Martin Eberle, Peter Meier, Cristina Nieto-Oberhuber, Karl-Heinz Alt-\nmann, Gisbert Schneider, Edgar Jacoby, and Steffen Renner. A collection of robust organic\nsynthesis reactions for in silico molecule design. Journal of chemical information and modeling,\n51(12):3093\u20133098, 2011.\n\n[8] Matthew A Kayala, Chlo\u00e9-Agathe Azencott, Jonathan H Chen, and Pierre Baldi. Learning to\npredict chemical reactions. Journal of chemical information and modeling, 51(9):2209\u20132222,\n2011.\n\n[9] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular\ngraph convolutions: moving beyond \ufb01ngerprints. Journal of computer-aided molecular design,\n30(8):595\u2013608, 2016.\n\n[10] Diederik P Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In\n\nInternational Conference on Learning Representation, 2015.\n\n[11] James Law, Zsolt Zsoldos, Aniko Simon, Darryl Reid, Yang Liu, Sing Yoong Khew, A Peter\nJohnson, Sarah Major, Robert A Wade, and Howard Y Ando. Route designer: a retrosynthetic\nanalysis tool utilizing automated retrosynthetic rule generation. J. Chem. Inf. Model., 49(3):\n593\u2013602, 2009. ISSN 1549-9596.\n\n[12] Tao Lei, Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Deriving neural architectures\nfrom sequence and graph kernels. In Proceedings of 34th International Conference on Machine\nLearning (ICML), 2017.\n\n[13] D. M. Lowe. Patent reaction extraction: downloads; https://bitbucket.org/dan2097/\n\npatent-reaction-extraction/downloads. 2014.\n\n[14] Marwin HS Segler and Mark P Waller. Neural-symbolic machine learning for retrosynthesis\n\nand reaction prediction. Chemistry-A European Journal, 2017.\n\n[15] Sara Szymkuc, Ewa P. Gajewska, Tomasz Klucznik, Karol Molga, Piotr Dittwald, Micha\u0142\nStartek, Micha\u0142 Bajczyk, and Bartosz A. Grzybowski. Computer-assisted synthetic planning:\nThe end of the beginning. Angew. Chem., Int. Ed., 55(20):5904\u20135937, 2016. ISSN 1521-3773.\ndoi: 10.1002/anie.201506101. URL http://dx.doi.org/10.1002/anie.201506101.\n\n[16] Matthew H Todd. Computer-aided organic synthesis. Chemical Society Reviews, 34(3):247\u2013266,\n\n2005.\n\n[17] Wendy A Warr. A short review of chemical reaction database systems, computer-aided synthesis\ndesign, reaction prediction and synthetic feasibility. Molecular Informatics, 33(6-7):469\u2013476,\n2014.\n\n[18] Jennifer N Wei, David Duvenaud, and Al\u00e1n Aspuru-Guzik. Neural networks for the prediction\n\nof organic chemistry reactions. ACS Central Science, 2(10):725\u2013732, 2016.\n\n10\n\n\f", "award": [], "sourceid": 1498, "authors": [{"given_name": "Wengong", "family_name": "Jin", "institution": "MIT CSAIL"}, {"given_name": "Connor", "family_name": "Coley", "institution": "MIT"}, {"given_name": "Regina", "family_name": "Barzilay", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": "MIT"}]}