{"title": "Convolutional Networks on Graphs for Learning Molecular Fingerprints", "book": "Advances in Neural Information Processing Systems", "page_first": 2224, "page_last": 2232, "abstract": "We introduce a convolutional neural network that operates directly on graphs.These networks allow end-to-end learning of prediction pipelines whose inputs are graphs of arbitrary size and shape.The architecture we present generalizes standard molecular feature extraction methods based on circular fingerprints.We show that these data-driven features are more interpretable, and have better predictive performance on a variety of tasks.", "full_text": "Convolutional Networks on Graphs\nfor Learning Molecular Fingerprints\n\nDavid Duvenaud\u2020, Dougal Maclaurin\u2020, Jorge Aguilera-Iparraguirre\n\nRafael G\u00b4omez-Bombarelli, Timothy Hirzel, Al\u00b4an Aspuru-Guzik, Ryan P. Adams\n\nHarvard University\n\nAbstract\n\nWe introduce a convolutional neural network that operates directly on graphs.\nThese networks allow end-to-end learning of prediction pipelines whose inputs\nare graphs of arbitrary size and shape. The architecture we present generalizes\nstandard molecular feature extraction methods based on circular \ufb01ngerprints. We\nshow that these data-driven features are more interpretable, and have better pre-\ndictive performance on a variety of tasks.\n\n1\n\nIntroduction\n\nRecent work in materials design used neural networks to predict the properties of novel molecules\nby generalizing from examples. One dif\ufb01culty with this task is that the input to the predictor, a\nmolecule, can be of arbitrary size and shape. Currently, most machine learning pipelines can only\nhandle inputs of a \ufb01xed size. The current state of the art is to use off-the-shelf \ufb01ngerprint software\nto compute \ufb01xed-dimensional feature vectors, and use those features as inputs to a fully-connected\ndeep neural network or other standard machine learning method. This formula was followed by\n[28, 3, 19]. During training, the molecular \ufb01ngerprint vectors were treated as \ufb01xed.\nIn this paper, we replace the bottom layer of this stack \u2013 the function that computes molecular\n\ufb01ngerprint vectors \u2013 with a differentiable neural network whose input is a graph representing the\noriginal molecule. In this graph, vertices represent individual atoms and edges represent bonds. The\nlower layers of this network is convolutional in the sense that the same local \ufb01lter is applied to each\natom and its neighborhood. After several such layers, a global pooling step combines features from\nall the atoms in the molecule.\nThese neural graph \ufb01ngerprints offer several advantages over \ufb01xed \ufb01ngerprints:\n\n\u2022 Predictive performance. By using data adapting to the task at hand, machine-optimized\n\ufb01ngerprints can provide substantially better predictive performance than \ufb01xed \ufb01ngerprints.\nWe show that neural graph \ufb01ngerprints match or beat the predictive performance of stan-\ndard \ufb01ngerprints on solubility, drug ef\ufb01cacy, and organic photovoltaic ef\ufb01ciency datasets.\n\u2022 Parsimony. Fixed \ufb01ngerprints must be extremely large to encode all possible substructures\nwithout overlap. For example, [28] used a \ufb01ngerprint vector of size 43,000, after having\nremoved rarely-occurring features. Differentiable \ufb01ngerprints can be optimized to encode\nonly relevant features, reducing downstream computation and regularization requirements.\n\u2022 Interpretability. Standard \ufb01ngerprints encode each possible fragment completely dis-\ntinctly, with no notion of similarity between fragments. In contrast, each feature of a neural\ngraph \ufb01ngerprint can be activated by similar but distinct molecular fragments, making the\nfeature representation more meaningful.\n\n\u2020Equal contribution.\n\n1\n\n\fFigure 1: Left: A visual representation of the computational graph of both standard circular \ufb01n-\ngerprints and neural graph \ufb01ngerprints. First, a graph is constructed matching the topology of the\nmolecule being \ufb01ngerprinted, in which nodes represent atoms, and edges represent bonds. At each\nlayer, information \ufb02ows between neighbors in the graph. Finally, each node in the graph turns on\none bit in the \ufb01xed-length \ufb01ngerprint vector. Right: A more detailed sketch including the bond\ninformation used in each operation.\n\n2 Circular \ufb01ngerprints\nThe state of the art\nin molecular \ufb01ngerprints are extended-connectivity circular \ufb01ngerprints\n(ECFP) [21]. Circular \ufb01ngerprints [6] are a re\ufb01nement of the Morgan algorithm [17], designed\nto encode which substructures are present in a molecule in a way that is invariant to atom-relabeling.\nCircular \ufb01ngerprints generate each layer\u2019s features by applying a \ufb01xed hash function to the concate-\nnated features of the neighborhood in the previous layer. The results of these hashes are then treated\nas integer indices, where a 1 is written to the \ufb01ngerprint vector at the index given by the feature\nvector at each node in the graph. Figure 1(left) shows a sketch of this computational architecture.\nIgnoring collisions, each index of the \ufb01ngerprint denotes the presence of a particular substructure.\nThe size of the substructures represented by each index depends on the depth of the network. Thus\nthe number of layers is referred to as the \u2018radius\u2019 of the \ufb01ngerprints.\nCircular \ufb01ngerprints are analogous to convolutional networks in that they apply the same operation\nlocally everywhere, and combine information in a global pooling step.\n\n3 Creating a differentiable \ufb01ngerprint\nThe space of possible network architectures is large. In the spirit of starting from a known-good con-\n\ufb01guration, we designed a differentiable generalization of circular \ufb01ngerprints. This section describes\nour replacement of each discrete operation in circular \ufb01ngerprints with a differentiable analog.\n\nHashing The purpose of the hash functions applied at each layer of circular \ufb01ngerprints is to\ncombine information about each atom and its neighboring substructures. This ensures that any\nchange in a fragment, no matter how small, will lead to a different \ufb01ngerprint index being activated.\nWe replace the hash operation with a single layer of a neural network. Using a smooth function\nallows the activations to be similar when the local molecular structure varies in unimportant ways.\n\nIndexing Circular \ufb01ngerprints use an indexing operation to combine all the nodes\u2019 feature vectors\ninto a single \ufb01ngerprint of the whole molecule. Each node sets a single bit of the \ufb01ngerprint to one,\nat an index determined by the hash of its feature vector. This pooling-like operation converts an\narbitrary-sized graph into a \ufb01xed-sized vector. For small molecules and a large \ufb01ngerprint length,\nthe \ufb01ngerprints are always sparse. We use the softmax operation as a differentiable analog of\nindexing. In essence, each atom is asked to classify itself as belonging to a single category. The sum\nof all these classi\ufb01cation label vectors produces the \ufb01nal \ufb01ngerprint. This operation is analogous to\nthe pooling operation in standard convolutional neural networks.\n\n2\n\n\fradius R, \ufb01ngerprint\n\nlength S\n\nAlgorithm 1 Circular \ufb01ngerprints\n1: Input: molecule,\n2: Initialize: \ufb01ngerprint vector f \u2190 0S\n3: for each atom a in molecule\nra \u2190 g(a)\n4:\n5: for L = 1 to R\n6:\n7:\n8:\n9:\n10:\n11:\n12: Return: binary vector f\n\nfor each atom a in molecule\nr1 . . . rN = neighbors(a)\nv \u2190 [ra, r1, . . . , rN ]\nra \u2190 hash(v)\ni \u2190 mod(ra, S)\nfi \u2190 1\n\n(cid:46) concatenate\n(cid:46) hash function\n(cid:46) convert to index\n(cid:46) Write 1 at index\n\n(cid:46) lookup atom features\n(cid:46) for each layer\n\nH 1\n\n1 . . . H 5\n\nAlgorithm 2 Neural graph \ufb01ngerprints\n1: Input: molecule, radius R, hidden weights\nR, output weights W1 . . . WR\n2: Initialize: \ufb01ngerprint vector f \u2190 0S\n3: for each atom a in molecule\nra \u2190 g(a)\n4:\n5: for L = 1 to R\n6:\n7:\n8:\n9:\n10:\n11:\n12: Return: real-valued vector f\n\nfor each atom a in molecule\nr1 . . . rN = neighbors(a)\ni=1 ri\nra \u2190 \u03c3(vH N\nL )\ni \u2190 softmax(raWL)\nf \u2190 f + i\n\n(cid:46) sum\n(cid:46) smooth function\n(cid:46) sparsify\n(cid:46) add to \ufb01ngerprint\n\nv \u2190 ra +(cid:80)N\n\n(cid:46) lookup atom features\n(cid:46) for each layer\n\nFigure 2: Pseudocode of circular \ufb01ngerprints (left) and neural graph \ufb01ngerprints (right). Differences\nare highlighted in blue. Every non-differentiable operation is replaced with a differentiable analog.\n\nCanonicalization Circular \ufb01ngerprints are identical regardless of the ordering of atoms in each\nneighborhood. This invariance is achieved by sorting the neighboring atoms according to their\nfeatures, and bond features. We experimented with this sorting scheme, and also with applying the\nlocal feature transform on all possible permutations of the local neighborhood. An alternative to\ncanonicalization is to apply a permutation-invariant function, such as summation. In the interests of\nsimplicity and scalability, we chose summation.\nCircular \ufb01ngerprints can be interpreted as a special case of neural graph \ufb01ngerprints having large\nrandom weights. This is because, in the limit of large input weights, tanh nonlinearities approach\nstep functions, which when concatenated form a simple hash function. Also, in the limit of large\ninput weights, the softmax operator approaches a one-hot-coded argmax operator, which is anal-\nogous to an indexing operation.\nAlgorithms 1 and 2 summarize these two algorithms and highlight their differences. Given a \ufb01nger-\nprint length L, and F features at each layer, the parameters of neural graph \ufb01ngerprints consist of\na separate output weight matrix of size F \u00d7 L for each layer, as well as a set of hidden-to-hidden\nweight matrices of size F \u00d7 F at each layer, one for each possible number of bonds an atom can\nhave (up to 5 in organic molecules).\n\n4 Experiments\n\nWe ran two experiments to demonstrate that neural \ufb01ngerprints with large random weights behave\nsimilarly to circular \ufb01ngerprints. First, we examined whether distances between circular \ufb01ngerprints\nwere similar to distances between neural \ufb01ngerprint-based distances. Figure 3 (left) shows a scat-\nterplot of pairwise distances between circular vs. neural \ufb01ngerprints. Fingerprints had length 2048,\nand were calculated on pairs of molecules from the solubility dataset [4]. Distance was measured\nusing a continuous generalization of the Tanimoto (a.k.a. Jaccard) similarity measure, given by\n\ndistance(x, y) = 1 \u2212(cid:88)\n\n(cid:46)(cid:88)\n\nmin(xi, yi)\n\nmax(xi, yi)\n\n(1)\n\nThere is a correlation of r = 0.823 between the distances. The line of points on the right of the plot\nshows that for some pairs of molecules, binary ECFP \ufb01ngerprints have exactly zero overlap.\nSecond, we examined the predictive performance of neural \ufb01ngerprints with large random weights\nvs. that of circular \ufb01ngerprints. Figure 3 (right) shows average predictive performance on the sol-\nubility dataset, using linear regression on top of \ufb01ngerprints. The performances of both methods\nfollow similar curves. In contrast, the performance of neural \ufb01ngerprints with small random weights\nfollows a different curve, and is substantially better. This suggests that even with random weights,\nthe relatively smooth activation of neural \ufb01ngerprints helps generalization performance.\n\n3\n\n\fFigure 3: Left: Comparison of pairwise distances between molecules, measured using circular \ufb01n-\ngerprints and neural graph \ufb01ngerprints with large random weights. Right: Predictive performance\nof circular \ufb01ngerprints (red), neural graph \ufb01ngerprints with \ufb01xed large random weights (green) and\nneural graph \ufb01ngerprints with \ufb01xed small random weights (blue). The performance of neural graph\n\ufb01ngerprints with large random weights closely matches the performance of circular \ufb01ngerprints.\n\n4.1 Examining learned features\n\nTo demonstrate that neural graph \ufb01ngerprints are interpretable, we show substructures which most\nactivate individual features in a \ufb01ngerprint vector. Each feature of a circular \ufb01ngerprint vector can\neach only be activated by a single fragment of a single radius, except for accidental collisions.\nIn contrast, neural graph \ufb01ngerprint features can be activated by variations of the same structure,\nmaking them more interpretable, and allowing shorter feature vectors.\n\nSolubility features Figure 4 shows the fragments that maximally activate the most predictive fea-\ntures of a \ufb01ngerprint. The \ufb01ngerprint network was trained as inputs to a linear model predicting\nsolubility, as measured in [4]. The feature shown in the top row has a positive predictive relationship\nwith solubility, and is most activated by fragments containing a hydrophilic R-OH group, a standard\nindicator of solubility. The feature shown in the bottom row, strongly predictive of insolubility, is\nactivated by non-polar repeated ring structures.\n\nFragments most\n\nactivated by\npro-solubility\n\nfeature\n\nFragments most\n\nactivated by\nanti-solubility\n\nfeature\n\nFigure 4: Examining \ufb01ngerprints optimized for predicting solubility. Shown here are representative\nexamples of molecular fragments (highlighted in blue) which most activate different features of the\n\ufb01ngerprint. Top row: The feature most predictive of solubility. Bottom row: The feature most\npredictive of insolubility.\n\n4\n\n0.50.60.70.80.91.0Circular fingerprint distances0.50.60.70.80.91.0Neural fingerprint distancesNeural vs Circular distances, r=0:8230123456Fingerprint radius0.81.01.21.41.61.82.0RMSE (log Mol/L)Circular fingerprintsRandom conv with large parametersRandom conv with small parametersOOHONHOOHOH\fToxicity features We trained the same model architecture to predict toxicity, as measured in two\ndifferent datasets in [26]. Figure 5 shows fragments which maximally activate the feature most\npredictive of toxicity, in two separate datasets.\n\nFragments most\n\nactivated by\n\ntoxicity feature\non SR-MMP\n\ndataset\n\nFragments most\n\nactivated by\n\ntoxicity feature\non NR-AHR\n\ndataset\n\nFigure 5: Visualizing \ufb01ngerprints optimized for predicting toxicity. Shown here are representative\nsamples of molecular fragments (highlighted in red) which most activate the feature most predictive\nof toxicity. Top row: the most predictive feature identi\ufb01es groups containing a sulphur atom attached\nto an aromatic ring. Bottom row: the most predictive feature identi\ufb01es fused aromatic rings, also\nknown as polycyclic aromatic hydrocarbons, a well-known carcinogen.\n\n[27] constructed similar visualizations, but in a semi-manual way: to determine which toxic frag-\nments activated a given neuron, they searched over a hand-made list of toxic substructures and chose\nthe one most correlated with a given neuron. In contrast, our visualizations are generated automati-\ncally, without the need to restrict the range of possible answers beforehand.\n\n4.2 Predictive Performance\n\nWe ran several experiments to compare the predictive performance of neural graph \ufb01ngerprints to\nthat of the standard state-of-the-art setup: circular \ufb01ngerprints fed into a fully-connected neural\nnetwork.\n\nExperimental setup Our pipeline takes as input the SMILES [30] string encoding of each\nmolecule, which is then converted into a graph using RDKit [20]. We also used RDKit to produce\nthe extended circular \ufb01ngerprints used in the baseline. Hydrogen atoms were treated implicitly.\nIn our convolutional networks, the initial atom and bond features were chosen to be similar to those\nused by ECFP: Initial atom features concatenated a one-hot encoding of the atom\u2019s element, its\ndegree, the number of attached hydrogen atoms, and the implicit valence, and an aromaticity indi-\ncator. The bond features were a concatenation of whether the bond type was single, double, triple,\nor aromatic, whether the bond was conjugated, and whether the bond was part of a ring.\n\nTraining and Architecture Training used batch normalization [11]. We also experimented with\ntanh vs relu activation functions for both the neural \ufb01ngerprint network layers and the fully-\nconnected network layers. relu had a slight but consistent performance advantage on the valida-\ntion set. We also experimented with dropconnect [29], a variant of dropout in which weights are\nrandomly set to zero instead of hidden units, but found that it led to worse validation error in gen-\neral. Each experiment optimized for 10000 minibatches of size 100 using the Adam algorithm [13],\na variant of RMSprop that includes momentum.\n\nHyperparameter Optimization To optimize hyperparameters, we used random search. The hy-\nperparameters of all methods were optimized using 50 trials for each cross-validation fold. The\nfollowing hyperparameters were optimized: log learning rate, log of the initial weight scale, the log\nL2 penalty, \ufb01ngerprint length, \ufb01ngerprint depth (up to 6), and the size of the hidden layer in the\nfully-connected network. Additionally, the size of the hidden feature vector in the convolutional\nneural \ufb01ngerprint networks was optimized.\n\n5\n\n\fDataset\nUnits\nPredict mean\nCircular FPs + linear layer\nCircular FPs + neural net\nNeural FPs + linear layer\nNeural FPs + neural net\n\nSolubility [4] Drug ef\ufb01cacy [5]\nlog Mol/L\n4.29 \u00b1 0.40\n1.71 \u00b1 0.13\n1.40 \u00b1 0.13\n0.77 \u00b1 0.11\n0.52 \u00b1 0.07\n\nEC50 in nM\n1.47 \u00b1 0.07\n1.13 \u00b1 0.03\n1.36 \u00b1 0.10\n1.15 \u00b1 0.02\n1.16 \u00b1 0.03\n\nPhotovoltaic ef\ufb01ciency [8]\npercent\n6.40 \u00b1 0.09\n2.63 \u00b1 0.09\n2.00 \u00b1 0.09\n2.58 \u00b1 0.18\n1.43 \u00b1 0.09\n\nTable 1: Mean predictive accuracy of neural \ufb01ngerprints compared to standard circular \ufb01ngerprints.\n\nDatasets We compared the performance of standard circular \ufb01ngerprints against neural graph \ufb01n-\ngerprints on a variety of domains:\n\n\u2022 Solubility: The aqueous solubility of 1144 molecules as measured by [4].\n\u2022 Drug ef\ufb01cacy: The half-maximal effective concentration (EC50) in vitro of 10,000\nmolecules against a sul\ufb01de-resistant strain of P. falciparum, the parasite that causes malaria,\nas measured by [5].\n\n\u2022 Organic photovoltaic ef\ufb01ciency: The Harvard Clean Energy Project [8] uses expensive\nDFT simulations to estimate the photovoltaic ef\ufb01ciency of organic molecules. We used a\nsubset of 20,000 molecules from this dataset.\n\nPredictive accuracy We compared the performance of circular \ufb01ngerprints and neural graph \ufb01n-\ngerprints under two conditions: In the \ufb01rst condition, predictions were made by a linear layer using\nthe \ufb01ngerprints as input. In the second condition, predictions were made by a one-hidden-layer\nneural network using the \ufb01ngerprints as input. In all settings, all differentiable parameters in the\ncomposed models were optimized simultaneously. Results are summarized in Table 4.2.\nIn all experiments, the neural graph \ufb01ngerprints matched or beat the accuracy of circular \ufb01ngerprints,\nand the methods with a neural network on top of the \ufb01ngerprints typically outperformed the linear\nlayers.\n\nSoftware Automatic differentiation (AD) software packages such as Theano [1] signi\ufb01cantly\nspeed up development time by providing gradients automatically, but can only handle limited control\nstructures and indexing. Since we required relatively complex control \ufb02ow and indexing in order\nto implement variants of Algorithm 2, we used a more \ufb02exible automatic differentiation package\nfor Python called Autograd (github.com/HIPS/autograd). This package handles standard\nNumpy [18] code, and can differentiate code containing while loops, branches, and indexing.\nCode\ngithub.com/HIPS/neural-fingerprint.\n\ncomputing neural \ufb01ngerprints\n\nand producing visualizations\n\nfor\n\nis\n\navailable\n\nat\n\n5 Limitations\n\nComputational cost Neural \ufb01ngerprints have the same asymptotic complexity in the number of\natoms and the depth of the network as circular \ufb01ngerprints, but have additional terms due to the\nmatrix multiplies necessary to transform the feature vector at each step. To be precise, computing\nthe neural \ufb01ngerprint of depth R, \ufb01ngerprint length L of a molecule with N atoms using a molecular\nconvolutional net having F features at each layer costs O(RN F L + RN F 2). In practice, training\nneural networks on top of circular \ufb01ngerprints usually took several minutes, while training both the\n\ufb01ngerprints and the network on top took on the order of an hour on the larger datasets.\n\nLimited computation at each layer How complicated should we make the function that goes\nfrom one layer of the network to the next? In this paper we chose the simplest feasible architecture:\na single layer of a neural network. However, it may be fruitful to apply multiple layers of nonlinear-\nities between each message-passing step (as in [22]), or to make information preservation easier by\nadapting the Long Short-Term Memory [10] architecture to pass information upwards.\n\n6\n\n\fLimited information propagation across the graph The local message-passing architecture de-\nveloped in this paper scales well in the size of the graph (due to the low degree of organic molecules),\nbut its ability to propagate information across the graph is limited by the depth of the network. This\nmay be appropriate for small graphs such as those representing the small organic molecules used in\nthis paper. However, in the worst case, it can take a depth N\n2 network to distinguish between graphs\nof size N. To avoid this problem, [2] proposed a hierarchical clustering of graph substructures. A\ntree-structured network could examine the structure of the entire graph using only log(N ) layers,\nbut would require learning to parse molecules. Techniques from natural language processing [25]\nmight be fruitfully adapted to this domain.\n\nInability to distinguish stereoisomers Special bookkeeping is required to distinguish between\nstereoisomers, including enantomers (mirror images of molecules) and cis/trans isomers (rotation\naround double bonds). Most circular \ufb01ngerprint implementations have the option to make these\ndistinctions. Neural \ufb01ngerprints could be extended to be sensitive to stereoisomers, but this remains\na task for future work.\n\n6 Related work\n\nThis work is similar in spirit to the neural Turing machine [7], in the sense that we take an existing\ndiscrete computational architecture, and make each part differentiable in order to do gradient-based\noptimization.\n\nNeural nets for quantitative structure-activity relationship (QSAR) The modern standard for\npredicting properties of novel molecules is to compose circular \ufb01ngerprints with fully-connected\nneural networks or other regression methods. [3] used circular \ufb01ngerprints as inputs to an ensemble\nof neural networks, Gaussian processes, and random forests. [19] used circular \ufb01ngerprints (of depth\n2) as inputs to a multitask neural network, showing that multiple tasks helped performance.\n\nNeural graph \ufb01ngerprints The most closely related work is [15], who build a neural network\nhaving graph-valued inputs. Their approach is to remove all cycles and build the graph into a tree\nstructure, choosing one atom to be the root. A recursive neural network [23, 24] is then run from\nthe leaves to the root to produce a \ufb01xed-size representation. Because a graph having N nodes\nhas N possible roots, all N possible graphs are constructed. The \ufb01nal descriptor is a sum of the\nrepresentations computed by all distinct graphs. There are as many distinct graphs as there are\natoms in the network. The computational cost of this method thus grows as O(F 2N 2), where F\nis the size of the feature vector and N is the number of atoms, making it less suitable for large\nmolecules.\n\nConvolutional neural networks Convolutional neural networks have been used to model images,\nspeech, and time series [14]. However, standard convolutional architectures use a \ufb01xed computa-\ntional graph, making them dif\ufb01cult to apply to objects of varying size or structure, such as molecules.\nMore recently, [12] and others have developed a convolutional neural network architecture for mod-\neling sentences of varying length.\n\nNeural networks on \ufb01xed graphs\n[2] introduce convolutional networks on graphs in the regime\nwhere the graph structure is \ufb01xed, and each training example differs only in having different features\nat the vertices of the same graph. In contrast, our networks address the situation where each training\ninput is a different graph.\n\nNeural networks on input-dependent graphs\n[22] propose a neural network model for graphs\nhaving an interesting training procedure. The forward pass consists of running a message-passing\nscheme to equilibrium, a fact which allows the reverse-mode gradient to be computed without storing\nthe entire forward computation. They apply their network to predicting mutagenesis of molecular\ncompounds as well as web page rankings. [16] also propose a neural network model for graphs\nwith a learning scheme whose inner loop optimizes not the training loss, but rather the correlation\nbetween each newly-proposed vector and the training error residual. They apply their model to a\ndataset of boiling points of 150 molecular compounds. Our paper builds on these ideas, with the\n\n7\n\n\ffollowing differences: Our method replaces their complex training algorithms with simple gradient-\nbased optimization, generalizes existing circular \ufb01ngerprint computations, and applies these net-\nworks in the context of modern QSAR pipelines which use neural networks on top of the \ufb01ngerprints\nto increase model capacity.\n\nUnrolled inference algorithms\n[9] and others have noted that iterative inference procedures\nsometimes resemble the feedforward computation of a recurrent neural network. One natural exten-\nsion of these ideas is to parameterize each inference step, and train a neural network to approximately\nmatch the output of exact inference using only a small number of iterations. The neural \ufb01ngerprint,\nwhen viewed in this light, resembles an unrolled message-passing algorithm on the original graph.\n\n7 Conclusion\n\nWe generalized existing hand-crafted molecular features to allow their optimization for diverse tasks.\nBy making each operation in the feature pipeline differentiable, we can use standard neural-network\ntraining methods to scalably optimize the parameters of these neural molecular \ufb01ngerprints end-to-\nend. We demonstrated the interpretability and predictive performance of these new \ufb01ngerprints.\nData-driven features have already replaced hand-crafted features in speech recognition, machine\nvision, and natural-language processing. Carrying out the same task for virtual screening, drug\ndesign, and materials design is a natural next step.\n\nAcknowledgments\n\nWe thank Edward Pyzer-Knapp, Jennifer Wei, and Samsung Advanced Institute of Technology for\ntheir support. This work was partially funded by NSF IIS-1421780.\n\nReferences\n\n[1] Fr\u00b4ed\u00b4eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud\nBergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improve-\nments. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.\n\n[2] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally\n\nconnected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.\n\n[3] George E. Dahl, Navdeep Jaitly, and Ruslan Salakhutdinov. Multi-task neural networks for\n\nQSAR predictions. arXiv preprint arXiv:1406.1231, 2014.\n\n[4] John S. Delaney. ESOL: Estimating aqueous solubility directly from molecular structure. Jour-\n\nnal of Chemical Information and Computer Sciences, 44(3):1000\u20131005, 2004.\n\n[5] Francisco-Javier Gamo, Laura M Sanz, Jaume Vidal, Cristina de Cozar, Emilio Alvarez,\nJose-Luis Lavandera, Dana E Vanderwall, Darren VS Green, Vinod Kumar, Samiul Hasan,\net al. Thousands of chemical starting points for antimalarial lead identi\ufb01cation. Nature,\n465(7296):305\u2013310, 2010.\n\n[6] Robert C. Glem, Andreas Bender, Catrin H. Arnby, Lars Carlsson, Scott Boyer, and James\nSmith. Circular \ufb01ngerprints: \ufb02exible molecular descriptors with applications from physical\nchemistry to ADME. IDrugs: the investigational drugs journal, 9(3):199\u2013204, 2006.\n\n[7] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing machines. arXiv preprint\n\narXiv:1410.5401, 2014.\n\n[8] Johannes Hachmann, Roberto Olivares-Amaya, Sule Atahan-Evrenk, Carlos Amador-Bedolla,\nRoel S S\u00b4anchez-Carrera, Aryeh Gold-Parker, Leslie Vogt, Anna M Brockway, and Al\u00b4an\nAspuru-Guzik. The Harvard clean energy project: large-scale computational screening and\ndesign of organic photovoltaics on the world community grid. The Journal of Physical Chem-\nistry Letters, 2(17):2241\u20132251, 2011.\n\n[9] John R Hershey, Jonathan Le Roux, and Felix Weninger. Deep unfolding: Model-based inspi-\n\nration of novel deep architectures. arXiv preprint arXiv:1409.2574, 2014.\n\n8\n\n\f[10] Sepp Hochreiter and J\u00a8urgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[12] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network\nfor modelling sentences. Proceedings of the 52nd Annual Meeting of the Association for Com-\nputational Linguistics, June 2014.\n\n[13] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[14] Yann LeCun and Yoshua Bengio. Convolutional networks for images, speech, and time series.\n\nThe handbook of brain theory and neural networks, 3361, 1995.\n\n[15] Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. Deep architectures and deep learning\nin chemoinformatics: the prediction of aqueous solubility for drug-like molecules. Journal of\nchemical information and modeling, 53(7):1563\u20131575, 2013.\n\n[16] Alessio Micheli. Neural network for graphs: A contextual constructive approach. Neural\n\nNetworks, IEEE Transactions on, 20(3):498\u2013511, 2009.\n\n[17] H.L. Morgan. The generation of a unique machine description for chemical structure. Journal\n\nof Chemical Documentation, 5(2):107\u2013113, 1965.\n\n[18] Travis E Oliphant. Python for scienti\ufb01c computing. Computing in Science & Engineering,\n\n9(3):10\u201320, 2007.\n\n[19] Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding, and Vijay\n\nPande. Massively multitask networks for drug discovery. arXiv:1502.02072, 2015.\n\n[20] RDKit: Open-source cheminformatics. www.rdkit.org. [accessed 11-April-2013].\n[21] David Rogers and Mathew Hahn. Extended-connectivity \ufb01ngerprints. Journal of Chemical\n\nInformation and Modeling, 50(5):742\u2013754, 2010.\n\n[22] F. Scarselli, M. Gori, Ah Chung Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural\n\nnetwork model. Neural Networks, IEEE Transactions on, 20(1):61\u201380, Jan 2009.\n\n[23] Richard Socher, Eric H Huang, Jeffrey Pennin, Christopher D Manning, and Andrew Y Ng.\nDynamic pooling and unfolding recursive autoencoders for paraphrase detection. In Advances\nin Neural Information Processing Systems, pages 801\u2013809, 2011.\n\n[24] Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and Christopher D Man-\nning. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Pro-\nceedings of the Conference on Empirical Methods in Natural Language Processing, pages\n151\u2013161. Association for Computational Linguistics, 2011.\n\n[25] Kai Sheng Tai, Richard Socher, and Christopher D Manning.\n\nrepresentations from tree-structured long short-term memory networks.\narXiv:1503.00075, 2015.\n\nImproved semantic\narXiv preprint\n\n[26] Tox21 Challenge. National center for advancing translational sciences. http://tripod.\n\nnih.gov/tox21/challenge, 2014. [Online; accessed 2-June-2015].\n\n[27] Thomas Unterthiner, Andreas Mayr, G\u00a8unter Klambauer, and Sepp Hochreiter. Toxicity predic-\n\ntion using deep learning. arXiv preprint arXiv:1503.01445, 2015.\n\n[28] Thomas Unterthiner, Andreas Mayr, G \u00a8unter Klambauer, Marvin Steijaert, J\u00a8org Wenger, Hugo\nIn\n\nCeulemans, and Sepp Hochreiter. Deep learning as an opportunity in virtual screening.\nAdvances in Neural Information Processing Systems, 2014.\n\n[29] Li Wan, Matthew Zeiler, Sixin Zhang, Yann L. Cun, and Rob Fergus. Regularization of neural\n\nnetworks using dropconnect. In International Conference on Machine Learning, 2013.\n\n[30] David Weininger. SMILES, a chemical language and information system. Journal of chemical\n\ninformation and computer sciences, 28(1):31\u201336, 1988.\n\n9\n\n\f", "award": [], "sourceid": 1321, "authors": [{"given_name": "David", "family_name": "Duvenaud", "institution": "Harvard University"}, {"given_name": "Dougal", "family_name": "Maclaurin", "institution": "Harvard University"}, {"given_name": "Jorge", "family_name": "Iparraguirre", "institution": "Harvard University"}, {"given_name": "Rafael", "family_name": "Bombarell", "institution": "Harvard University"}, {"given_name": "Timothy", "family_name": "Hirzel", "institution": "Harvard University"}, {"given_name": "Alan", "family_name": "Aspuru-Guzik", "institution": "Harvard University"}, {"given_name": "Ryan", "family_name": "Adams", "institution": "Harvard"}]}