{"title": "Learning Invariant Representations of Molecules for Atomization Energy Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 440, "page_last": 448, "abstract": "The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difficult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a benefit from setting flexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to the holy grail of ''chemical accuracy''.", "full_text": "Learning Invariant Representations of Molecules for\n\nAtomization Energy Prediction\n\nGr\u00e9goire Montavon1\u2217, Katja Hansen2, Siamac Fazli1, Matthias Rupp3, Franziska Biegler1,\n\nAndreas Ziehe1, Alexandre Tkatchenko2, O. Anatole von Lilienfeld4, Klaus-Robert M\u00fcller1,5\u2020\n\n1. Machine Learning Group, TU Berlin\n\n2. Fritz-Haber-Institut der Max-Planck-Gesellschaft, Berlin\n\n3. Institute of Pharmaceutical Sciences, ETH Zurich\n\n4. Argonne Leadership Computing Facility, Argonne National Laboratory, Lemont, IL\n\n5. Dept. of Brain and Cognitive Engineering, Korea University\n\nAbstract\n\nThe accurate prediction of molecular energetics in chemical compound space is\na crucial ingredient for rational compound design. The inherently graph-like,\nnon-vectorial nature of molecular data gives rise to a unique and dif\ufb01cult ma-\nchine learning problem. In this paper, we adopt a learning-from-scratch approach\nwhere quantum-mechanical molecular energies are predicted directly from the raw\nmolecular geometry. The study suggests a bene\ufb01t from setting \ufb02exible priors and\nenforcing invariance stochastically rather than structurally. Our results improve\nthe state-of-the-art by a factor of almost three, bringing statistical methods one\nstep closer to chemical accuracy.\n\n1\n\nIntroduction\n\nThe accurate prediction of molecular energetics in chemical compound space (CCS) is a crucial\ningredient for compound design efforts in chemical and pharmaceutical industries. One of the ma-\njor challenges consists of making quantitative estimates in CCS at moderate computational cost\n(milliseconds per compound or faster). Currently only high level quantum-chemistry calculations,\nwhich can take days per molecule depending on property and system, yield the desired \u201cchemical\naccuracy\u201d of 1 kcal/mol required for computational molecular design.\n\nThis problem has only recently captured the interest of the machine learning community (Baldi\net al., 2011). The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique\nand dif\ufb01cult machine learning problem. A central question is how to represent molecules in a way\nthat makes prediction of molecular properties feasible and accurate (Von Lilienfeld and Tuckerman,\n2006). This question has already been extensively discussed in the cheminformatics literature, and\nmany so-called molecular descriptors exist (Todeschini and Consonni, 2009). Unfortunately, they\noften require a substantial amount of domain knowledge and engineering. Furthermore, they are not\nnecessarily transferable across the whole chemical compound space.\n\nIn this paper, we pursue a more direct approach initiated by Rupp et al. (2012) to the problem.\nWe learn the mapping between the molecule and its atomization energy from scratch1 using the\n\u201cCoulomb matrix\u201d as a low-level molecular descriptor (Rupp et al., 2012). As we will see later, an\n\n\u2217Electronic address: gregoire.montavon@tu-berlin.de\n\u2020Electronic address: klaus-robert.mueller@tu-berlin.de\n1This approach has already been applied in multiple domains such as natural language processing (Collobert\n\net al., 2011) or speech recognition (Jaitly and Hinton, 2011).\n\n1\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 1: Different representations of the same molecule: (a) raw molecule with Cartesian coordi-\nnates and associated charges, (b) original (non-sorted) Coulomb matrix as computed by Equation\n1, (c) eigenspectrum of the Coulomb matrix, (d) sorted Coulomb matrix, (e) set of randomly sorted\nCoulomb matrices.\n\ninherent problem of the Coulomb matrix descriptor is that it lacks invariance with respect to permu-\ntation of atom indices, thus, leading to an exponential blow-up of the problem\u2019s dimensionality. We\ncenter the discussion around the two following questions: How to inject permutation invariance op-\ntimally into the machine learning model? What are the model characteristics that lead to the highest\nprediction accuracy?\n\nOur study extends the work of Rupp et al. (2012) by empirically comparing several methods for\nenforcing permutation invariance: (1) computing the sorted eigenspectrum of the Coulomb matrix,\n(2) sorting the rows and columns by their respective norm and (3), a new idea, randomly sorting rows\nand columns in order to associate a set of randomly sorted Coulomb matrices to each molecule, thus\nextending the dataset considerably. These three representations are then compared in the light of\nseveral models such as Gaussian kernel ridge regression or multilayer neural networks where the\nGaussian prior is traded against more \ufb02exibility and the ability to learn the representation directly\nfrom the data.\n\nRelated Work\n\nIn atomic-scale physics and in material sciences, neural networks have been used to model the po-\ntential energy surface of single systems (e.g., the dynamics of a single molecule over time) since the\nearly 1990s (Lorenz et al., 2004; Manzhos and Carrington, 2006; Behler, 2011). Recently, Gaus-\nsian processes were used for this as well (Bart\u00f3k et al., 2010). The major difference to the problem\npresented here is that previous work in modeling quantum mechanical energies looked mostly at the\ndynamics of one molecule, whereas we use data from different molecules simultaneously (\u201clearning\nacross chemical compound space\u201d). Attempts in this direction have been rare (Balabin and Lomak-\nina, 2009; Hautier et al., 2010; Balabin and Lomakina, 2011).\n\n2 Representing Molecules\n\nElectronic structure methods based on quantum-mechanical \ufb01rst principles, only require a set of\nnuclear charges Zi and the corresponding Cartesian coordinates of the atomic positions in 3D space\nRi as an input for the calculation of molecular energetics. Here we use exactly the same information\nas input for our machine learning algorithms. Speci\ufb01cally, for each molecule, we construct the so-\ncalled Coulomb matrix C, that contains information about Zi and Ri in a way that preserves many\nof the required properties of a good descriptor (Rupp et al., 2012):\n\nCij =(0.5Z 2.4\n\ni\nZiZj\n\n|Ri\u2212Rj |\n\n\u2200i = j\n\u2200i 6= j.\n\n(1)\n\nThe diagonal elements of the Coulomb matrix correspond to a polynomial \ufb01t of the potential ener-\ngies of the free atoms, while the off-diagonal elements encode the Coulomb repulsion between all\npossible pairs of nuclei in the molecule. As such, the Coulomb matrix is invariant to translations\nand rotations of the molecule in 3D space; both transformations must keep the potential energy of\nthe molecule constant by de\ufb01nition.\n\nTwo problems with the Coulomb matrix representation that prevent it from being used out-of-the-\nbox in a vector-space model are the following: (1) the dimension of the Coulomb matrix depends\n\n2\n\n\fon the number of atoms in the molecule and (2) the ordering of atoms in the Coulomb matrix is\nunde\ufb01ned, that is, many Coulomb matrices can be associated to the same molecule by just permuting\nrows and columns.\n\nThe \ufb01rst problem can be mitigated by introducing \u201cinvisible atoms\u201d in the molecules, that have\nnuclear charge zero and do not interact with other atoms. These invisible atoms do not in\ufb02uence\nthe physics of the molecule of interest and make the total number of atoms in the molecule sum to\na constant d. In practice, this corresponds to padding the Coulomb matrix by zero-valued entries so\nthat the Coulomb matrix has size d \u00d7 d, as it has been done by Rupp et al. (2012).\nSolving the second problem is more dif\ufb01cult and has no obvious physically plausible workaround.\nThree candidate representations are depicted in Figure 1 and presented below.\n\n2.1 Eigenspectrum Representation\n\nThe eigenspectrum representation (Rupp et al., 2012) is obtained by solving the eigenvalue problem\nCv = \u03bbv under the constraint \u03bbi \u2265 \u03bbi+1 where \u03bbi > 0. The spectrum (\u03bb1, . . . , \u03bbd) is used as the\nrepresentation. It is easy to see that this representation is invariant to permutation of atoms in the\nCoulomb matrix.\nOn the other hand, the dimensionality of the eigenspectrum d is low compared to the initial 3d \u2212 6\ndegrees of freedom of most molecules. While this sharp dimensionality reduction may yield some\nuseful built-in regularization, it may also introduce unrecoverable noise.\n\n2.2 Sorted Coulomb Matrices\n\nAnother solution to the ordering problem is to choose the permutation of atoms whose associated\nCoulomb matrix C satis\ufb01es ||Ci|| \u2265 ||Ci+1|| \u2200 i where Ci denotes the ith row of the Coulomb\nmatrix. Unlike the eigenspectrum representation, two different molecules have necessarily different\nassociated sorted Coulomb matrices.\n\n2.3 Random(-ly sorted) Coulomb Matrices\n\nA way to deal with the larger dimensionality subsequent to taking the whole Coulomb matrix instead\nof the eigenspectrum is to extend the dataset with Coulomb matrices that are randomly sorted. This is\nachieved by associating a conditional distribution over Coulomb matrices p(C|M ) to each molecule\nM . Let C(M ) de\ufb01ne the set of matrices that are valid Coulomb matrices of the molecule M . The\nunnormalized probability distribution from which we would like to sample Coulomb matrices is\nde\ufb01ned as:\n\np\u22c6(C|M ) =Xn\n\n1C\u2208C(M ) \u00b7 1{||Ci||+ni\u2265||Ci+1||+ni+1\u2200i} \u00b7 pN (0,\u03c3I)(n)\n\n(2)\n\nThe \ufb01rst term constrains the sample to be a valid Coulomb matrix of M, the second term ensures\nthe sorting constraint and the third term de\ufb01nes the randomness parameterized by the noise level \u03c3.\nSampling from this distribution can be achieved approximately using the following algorithm:\n\nAlgorithm for generating a random Coulomb matrix\n\n1. Take any Coulomb matrix C among the set of matrices that are valid Coulomb\n\nmatrices of M and compute its row norm ||C|| = (||C1||, . . . ,||Cd||).\n\n2. Draw n \u223c N (0, \u03c3I) and \ufb01nd the permutation P that sorts ||C|| + n, that is, \ufb01nd\n\nthe permutation that satis\ufb01es permuteP (||C|| + n) = sort(||C|| + n).\n\n3. Permute C row-wise and then column-wise with the same permutation, that is,\n\nCrandom = permutecolsP (permuterowsP (C)).\n\nThe idea of dataset extension has already been used in the context of handwritten character recog-\nnition by, among others, LeCun et al. (1998), Ciresan et al. (2010) and in the context of support\nvector machines, by DeCoste and Sch\u00f6lkopf (2002). Random Coulomb matrices can be used at\n\n3\n\n\fInput (Coulomb matrix)\n\nOutput (atomization energy)\n\nFigure 2: Two-dimensional PCA of the data with increasingly strong label contribution (from left\nto right). Molecules with low atomization energies are depicted in red and molecules with high\natomization energies are depicted in blue. The plots suggest an interesting mix of global and local\nstatistics with highly non-Gaussian distributions.\n\ntraining time in order to multiply the number of data points but also at prediction time: predicting\nthe property of a molecule consists of predicting the properties for all Coulomb matrices among the\ndistribution of Coulomb matrices associated to M and output the average of all these predictions\ny = E\n\nC|M [f (C)].\n\n3 Predicting Atomization Energies\n\nThe atomization energy E quanti\ufb01es the potential energy stored in all chemical bonds. As such,\nit is de\ufb01ned as the difference between the potential energy of a molecule and the sum of potential\nenergies of its composing isolated atoms. The potential energy of a molecule is the solution to the\nelectronic Schr\u00f6dinger equation H\u03a6 = E\u03a6, where H is the Hamiltonian of the molecule and \u03a6 is\nthe state of the system. Note that the Hamiltonian is uniquely de\ufb01ned by the Coulomb matrix up to\nrotation and translation symmetries. A dataset {(M1, E1), . . . , (Mn, En)} is created by running a\nSchr\u00f6dinger equation solver on a small set of molecules. Figure 2 shows a two-dimensional PCA\nvisualization of the dataset where input and output distributions exhibit an interesting mix of local\nand global statistics.\n\nObtaining atomization energies from the Schr\u00f6dinger equation solver is computationally expensive\nand, as a consequence, only a fraction of the molecules in the chemical compound space can be\nlabeled. The learning algorithm is then asked to generalize from these few data points to unseen\nmolecules. In this section, we show how two algorithms of study, kernel ridge regression and the\nmultilayer neural network, are applied to this problem. These algorithms are well-established non-\nlinear methods and are good candidates for handling the intrinsic nonlinearities of the problem. In\nkernel ridge regression, the measure of similarity is encoded in the kernel. On the other hand, in\nmultilayer neural networks, the measure of similarity is learned essentially from data and implicitly\ngiven by the mapping onto increasingly many layers. In general, neural networks are more \ufb02exible\nand make less assumptions about the data. However, it comes at the cost of being more dif\ufb01cult to\ntrain and regularize.\n\n3.1 Kernel Ridge Regression\n\nThe most basic algorithm to solve the nonlinear regression problem at hand is kernel ridge regression\n(cf. Hastie et al., 2001). It uses a quadratic constraint on the norm of \u03b1i. As is well known, the\nsolution of the minimization problem\n\nmin\n\ni (cid:1)2\n\u03b1 Xi (cid:0)Eest(xi) \u2212 Eref\n\n+ \u03bbXi\n\n\u03b12\ni\n\nreads \u03b1 = (K + \u03bbI)\u22121Eref, where K is the empirical kernel and the input data xi is either the\neigenspectrum of the Coulomb matrix or the vectorized sorted Coulomb matrix.\n\nExpanding the dataset with the randomly generated Coulomb matrices described in Section 2.3\nyields a huge dataset that is dif\ufb01cult to handle with standard kernel ridge regression algorithms.\nAlthough approximations of the kernel can improve its scalability, random Coulomb matrices can\nbe handled more easily by encoding permutations directly into the kernel. We rede\ufb01ne the kernel as\n\n4\n\n\f(a)\n\n(b)\n\n1\n1\n1\n0\n1\n0\n0\n0\n0\n0\n\n1\n1\n1\n0\n0\n0\n0\n0\n0\n0\n\n1\n1\n1\n1\n1\n0\n1\n0\n1\n0\n1\n0\n0\n0\n0\n0\n1\n0\n0\n0\n\n0\n0\n1\n0\n1\n0\n1\n0\n1\n0\n1\n0\n0\n0\n0\n0\n1\n0\n0\n0\n\n1\n0\n1\n0\n1\n1\n0\n1\n1\n1\n1\n1\n0\n1\n1\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n1\n0\n\n0\n0\n1\n0\n1\n1\n0\n1\n1\n0\n1\n1\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n0\n0\n\n(c)\n0\n0\n0\n0\n0\n1\n0\n0\n0\n1\n1\n1\n0\n0\n0\n1\n1\n1\n0\n0\n0\n0\n1\n1\n0\n0\n0\n1\n1\n1\n0\n0\n0\n0\n1\n1\n0\n0\n0\n0\n1\n1\n0\n0\n0\n0\n1\n1\n0\n0\n0\n0\n1\n1\n0\n0\n0\n0\n0\n0\n\n0\n0\n0\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n0\n0\n\n0\n0\n1\n0\n1\n1\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n0\n1\n0\n0\n0\n0\n0\n0\n0\n0\n1\n0\n0\n0\n\n0\n0\n1\n0\n1\n0\n1\n0\n1\n0\n1\n0\n0\n0\n1\n0\n1\n0\n0\n0\n\n0\n0\n1\n0\n1\n0\n1\n0\n1\n0\n1\n0\n1\n0\n1\n0\n1\n0\n0\n0\n\n1\n0\n0\n0\n0\n0\n0\n0\n0\n0\n\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n\n(d)\n\n(e)\n\nE\n\nFigure 3: Data \ufb02ow from the raw molecule to the predicted atomization energy E. The molecule (a)\nis converted to its randomly sorted Coulomb matrix representation (b). The Coulomb matrix is then\nconverted into a suitable sensory input (c) that is fed to the neural network (d). The output of the\nneural network is then rescaled to the original energy unit (e).\n\na sum over permutations:\n\n\u02dcK(xi, xj) =\n\n1\n2\n\nL\n\nXl=1\n\n(K(xi, Pl(xj)) + K(Pl(xi), xj))\n\n(3)\n\nwhere Pl is the l-th permutation of atoms corresponding to the l-th realization of the random\nCoulomb matrix and L is the total number of permutations. This sum over multiple permutations\nhas the effect of testing multiple plausible alignments of molecules. Note that the summation can be\nreplaced by a \u201cmax\u201d operator in order to focus on correct alignments of molecules and ignore poor\nalignments.\n\n3.2 Multilayer Neural Networks\n\nA main feature of multilayer neural networks is their ability to learn internal representations that\npotentially make models statistically and computationally more ef\ufb01cient. Unfortunately, the in-\ntrinsically non-convex nature of neural networks makes them hard to optimize and regularize in a\nprincipled manner. Often, a crucial factor for training neural networks successfully, is to start with\na favorable initial conditioning of the learning problem, that is, a good sensory input representation\nand a proper weights initialization.\n\nUnlike images or speech data, an important amount of label-relevant information is contained within\nthe elements of the Coulomb matrix and not only in their dependencies. For these reasons, taking\nthe real quantities directly as input is likely to lead to a poorly conditioned optimization problem.\nInstead, we choose to break apart each dimension of the Coulomb matrix C by converting the rep-\nresentation into a three-dimensional tensor of essentially binary predicates as follows:\n\nx =h . . . , tanh(cid:16) C \u2212 \u03b8\n\n\u03b8 (cid:17), tanh(cid:16) C\n\n\u03b8(cid:17), tanh(cid:16) C + \u03b8\n\n\u03b8 (cid:17), . . .i\n\n(4)\n\nThe new representation x is fed as input to the neural network. Note that in the new representation,\nmany elements are constant and can be pruned. In practice, by choosing an appropriate step \u03b8, the\ndimensionality of the sensory input is kept to tractable levels.\n\nThis binarization of the input space improves the conditioning of the learning problem and makes\nthe model more \ufb02exible. As we will see in Section 5, learning from this \ufb02exible representation\nrequires enough data in order to compensate for the lack of a strong prior and might lead to low\nperformance if this condition is not met. The full data \ufb02ow from the raw molecule to the predicted\natomization energy is depicted in Figure 3.\n\n4 Methodology\n\nDataset As in Rupp et al. (2012), we select a subset of 7165 small molecules extracted from a\nhuge database of nearly one billion small molecules collected by Blum and Reymond (2009). These\nmolecules are composed of a maximum of 23 atoms, a maximum of 7 of them are heavy atoms.\n\n5\n\n\fMolecules are converted to a suitable Cartesian coordinates representation using universal force-\n\ufb01eld method (Rapp\u00e9 et al., 1992) as implemented in the software OpenBabel (Guha et al., 2006).\nThe Coulomb matrices can then be computed from these Cartesian coordinates using Equation 1. At-\nomization energies are calculated for each molecule and are ranging from \u2212800 to \u22122000 kcal/mol.\nAs a result, we have a dataset of 7165 Coulomb matrices of size 23 \u00d7 23 with their associated one-\ndimensional labels2. Random Coulomb matrices are generated with the noise parameter \u03c3 = 1 (see\nEquation 2).\n\nModel validation For each learning method we used strati\ufb01ed 5-fold cross validation with iden-\ntical cross validation folds, where the strati\ufb01cation was done by grouping molecules into groups\nof \ufb01ve by their energies and then randomly assigning one molecule to each fold, as in Rupp et al.\n(2012). This sampling reduces the variance of the test error estimator. Each algorithm is optimized\nfor mean squared error. To illustrate how the prediction accuracy changes when increasing the train-\ning sample size, each model was trained on 500 to 7000 data points which were sampled identically\nfor the different methods.\n\nChoice of parameters for kernel ridge regression The kernel ridge regression model was trained\nusing a Gaussian kernel (Kij = exp[\u2212||xi \u2212 xj||2/(2\u03c32)]) where \u03c3 is the kernel width. No fur-\n\nther scaling or normalization of the data was done, as the meaningfulness of the data in chemi-\ncal compound space was to be preserved. A grid search with an inner cross validation was used\nto determine the hyperparameters for each of the \ufb01ve cross validation folds for each method,\nnamely kernel width \u03c3 and regularization strength \u03bb. Grid-searching for optimal hyperparame-\nters can be easily parallelized. The regularization parameter was varied from 10\u221211 to 101 on\na logarithmic scale and the kernel width was varied from 5 to 81 on a linear scale with a step\nsize of 4. For the eigenspectrum representation the individual folds showed lower regulariza-\ntion parameters (\u03bbeig = 2.15 \u00b7 10\u221210 \u00b1 0.00) as compared to the sorted Coulomb representation\n(\u03bbsorted = 1.67 \u00b7 10\u22127 \u00b1 0.00). The optimal kernel width parameters are \u03c3eig = 41 \u00b1 6.07 and\n\u03c3sorted = 77 \u00b1 0.00. As indicated by the standard deviation 0.00, identical parameters are of-\nten chosen for all folds of cross-validation. Training one fold, for one particular set of param-\neters took approximately 10 seconds. When the algorithm is trained on random Coulomb ma-\ntrices, we set the number of permutations involved in the kernel to L = 250 (see Equation 3)\nand grid-search hyperparameters over both the \u201csum\u201d and \u201cmax\u201d kernels. Obtained parameters are\n\u03bbrandom = 0.0157 \u00b1 0.0247 and \u03c3random = 74 \u00b1 4.38.\nChoice of parameters for the neural network We choose a binarization step \u03b8 = 1 (see Equation\n4). As a result, the neural network takes approximately 1800 inputs. We use two hidden layers\ncomposed of 400 and 100 units with sigmoidal activation functions, respectively. Initial weights\n\nnumber of input units and \u03b30 is the global learning rate of the network set to \u03b30 = 0.01. The\n\nW0 and learning rates \u03b3 are chosen as W0 \u223c N (0, 1/\u221am) and \u03b3 = \u03b30/\u221am where m is the\nerror derivative is backpropagated from layer l to layer l \u2212 1 by multiplying it by \u03b7 = pm/n\n\nwhere m and n are the number of input and output units of layer l. These choices for W0, \u03b3 and\n\u03b7 ensure that the representations at each layer fall into the correct regime of the nonlinearity and\nthat weights in each layer evolve at the correct speed. Inputs and outputs are scaled to have mean\n0 and standard deviation 1. We use averaged stochastic gradient descent (ASGD) with minibatches\nof size 25 for a maximum of 250000 iterations and with ASGD coef\ufb01cients set so that the neural\nnetwork remembers approximately 10% of its training history. The training is performed on 90% of\nthe training set and the rest is used for early stopping. Training the neural network takes between one\nhour and one day on a CPU depending on the sample complexity. When using the random Coulomb\nmatrix representation, the prediction for a new molecule is averaged over 10 different realizations of\nits associated random Coulomb matrix.\n\n5 Results\n\nCross-validation results for each learning algorithm and representation are shown in Table 1. For\nthe sake of completeness, we also include some baseline results such as the mean predictor (simply\npredicting the mean of labels in the training set), linear regression, k-nearest neighbors, mixed effects\n\n2The dataset is available at http://www.quantum-machine.org.\n\n6\n\n\fK-nearest neighbors\n\nLinear regression\n\nMixed effects\n\nMean predictor\n\nNone\n\nLearning algorithm\n\nMolecule representation\n\nEigenspectrum\nSorted Coulomb\n\nEigenspectrum\nSorted Coulomb\n\nEigenspectrum\nSorted Coulomb\n\nEigenspectrum\nSorted Coulomb\n\nRMSE\n223.92 \u00b1 0.32\n92.49 \u00b1 2.70\n95.97 \u00b1 1.45\n38.01 \u00b1 1.11\n27.22 \u00b1 0.84\n20.38 \u00b1 9.29\n12.16 \u00b1 0.95\n19.47 \u00b1 9.46\n12.59 \u00b1 2.17\n16.01 \u00b1 1.71\n12.59 \u00b1 1.35\n11.40 \u00b1 1.11\n20.29 \u00b1 0.73\n16.01 \u00b1 0.81\n5.96 \u00b1 0.48\nTable 1: Prediction errors in terms of mean absolute error (MAE) and root mean square error\n(RMSE) for several algorithms and types of representations. Linear regression and k-nearest neigh-\nbors are inaccurate compared to the more re\ufb01ned kernel methods and multilayer neural network. The\nmultilayer neural network performance varies considerably depending on the type of representation\nbut sets the lowest error in our study on the random Coulomb representation.\n\nMAE\n179.02 \u00b1 0.08\n70.72 \u00b1 2.12\n71.54 \u00b1 0.97\n29.17 \u00b1 0.35\n20.72 \u00b1 0.32\n10.50 \u00b1 0.48\n8.5 \u00b1 0.45\n10.78 \u00b1 0.58\n8.06 \u00b1 0.38\n11.39 \u00b1 0.81\n8.72 \u00b1 0.40\n7.79 \u00b1 0.42\n14.08 \u00b1 0.29\n11.82 \u00b1 0.45\n3.51 \u00b1 0.13\n\nEigenspectrum\nSorted Coulomb\nRandom Coulomb\n\nEigenspectrum\nSorted Coulomb\nRandom Coulomb\n\nGaussian support vector regression\n\nGaussian kernel ridge regression\n\nMultilayer neural network\n\nmodels (Pinheiro and Bates, 2000; Fazli et al., 2011) and kernel support vector regression (Smola\nand Sch\u00f6lkopf, 2004). Linear regression and k-nearest neighbors are clearly off-the-mark compared\nto the other more sophisticated models such as mixed effects models, kernel methods and multilayer\nneural networks.\n\nWhile results for kernel algorithms are similar, they all differ considerably from those obtained with\nthe multilayer neural network. In particular, we can observe that they are performing reasonably well\nwith all types of representation while the multilayer neural network performance is highly dependent\non the representation fed as input.\n\nMore speci\ufb01cally, the multilayer neural network tends to perform better as the input representation\ngets richer (as the total amount of information in the input distribution increases), suggesting that the\nlack of a strong inbuilt prior in the neural network must be compensated by a large amount of data.\nThe neural network performs best with random Coulomb matrices that are intrinsically the richest\nrepresentation as a whole distribution over Coulomb matrices is associated to each molecule.\n\nA similar phenomenon can be observed from the learning curves in Figure 4. As the training data\nincreases, the error for Gaussian kernel ridge regression decreases slowly while the neural network\ncan take greater advantage from this additional data.\n\n6 Conclusion\n\nPredicting molecular energies quickly and accurately across the chemical compound space (CCS)\nis an important problem as the quantum-mechanical calculations are typically taking days and do\nnot scale well to more complex systems. Supervised statistical learning is a natural candidate for\nsolving this problem as it encourages computational units to focus on solving the problem of interest\nrather than solving the more general Schr\u00f6dinger equation.\n\nIn this paper, we have developed further the learning-from-scratch approach initiated by Rupp et al.\n(2012) and provided a deeper understanding of some of the ingredients for learning a successful\nmapping between raw molecular geometries and atomization energies. Our results suggest the im-\nportance of having \ufb02exible priors (in our case, a multilayer network) and lots of data (generated\narti\ufb01cially by exploiting symmetries of the Coulomb matrix). Our work improves the state-of-the-\nart on this dataset by a factor of almost three. From a reference MAE of 9.9 kcal/mol (Rupp et al.,\n\n7\n\n\fGaussian kernel ridge regression\n\nmultilayer neural network\n\n(cid:3)\n\nl\nl\na\no\nc\nm\nk\n(cid:2)\n\nr\no\nr\nr\ne\n\ne\nt\nu\nl\no\ns\nb\na\n\nn\na\ne\nm\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\nb eigenspectrum\n\nsorted Coulomb\nrandom Coulomb\n\n40\n\n(cid:3)\n\nl\nl\na\no\nc\nm\nk\n(cid:2)\n\n35\n\n30\n\nr\no\nr\nr\ne\n\ne\nt\nu\nl\no\ns\nb\na\n\nn\na\ne\nm\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\nb eigenspectrum\n\nsorted Coulomb\nrandom Coulomb\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\n# samples\n\n# samples\n\nFigure 4: Learning curves for Gaussian kernel ridge regression and the multilayer neural network.\nResults for kernel ridge regression are more invariant to the representation and to the number of\nsamples than for the multilayer neural network. The gray area at the bottom of the plot indicates the\nlevel at which the prediction is considered to be \u201cchemically accurate\u201d.\n\n2012), we went down to a MAE of 3.51 kcal/mol, which is considerably closer to the 1 kcal/mol\nrequired for chemical accuracy.\n\nMany open problems remain that makes quantum chemistry an attractive challenge for Machine\nLearning: (1) Are there fundamental modeling limits of the statistical learning approach for quantum\nchemistry applications or is it rather a matter of producing more training data? (2) The training\ndata can be considered noise free. Thus, are there better ML models for the noise free case while\nregularizing away the intrinsic problem complexity to keep the ML model small? (3) Can better\nrepresentations be devised with inbuilt invariance properties (e.g. Tangent Distance, Simard et al.,\n1996), harvesting physical prior knowledge? (4) How can we extract physics insights on quantum\nmechanics from the trained nonlinear ML prediction models?\n\nAcknowledgments\n\nThis work is supported by the World Class University Program through the National Research\nFoundation of Korea funded by the Ministry of Education, Science, and Technology, under Grant\nR31-10008, and the FP7 program of the European Community (Marie Curie IEF 273039). This\nresearch used resources of the Argonne Leadership Computing Facility at Argonne National Lab-\noratory, which is supported by the Of\ufb01ce of Science of the U.S. DOE under Contract No. DE-\nAC02-06CH11357. This research is supported, in part, by the Natural Sciences and Engineering\nResearch Council of Canada. The authors also thank M\u00e1rton Dan\u00f3czy for preliminary work and\nuseful discussions.\n\nReferences\n\nRoman M. Balabin and Ekaterina I. Lomakina. Neural network approach to quantum-chemistry\ndata: Accurate prediction of density functional theory energies. Journal of Chemical Physics,\n131(7):074104, 2009.\n\nRoman M. Balabin and Ekaterina I. Lomakina. Support vector machine regression (LS-SVM)\u2014\nan alternative to arti\ufb01cial neural networks (ANNs) for the analysis of quantum chemistry data?\nPhysical Chemistry Chemical Physics, 13(24):11710\u201311718, 2011.\n\nPierre Baldi, Klaus-Robert M\u00fcller, and Gisbert Schneider. Editorial: Charting chemical space: Chal-\nlenges and opportunities for arti\ufb01cial intelligence and machine learning. Molecular Informatics,\n30(9):751\u2013751, 2011.\n\nAlbert P. Bart\u00f3k, Mike C. Payne, Risi Kondor, and G\u00e1bor Cs\u00e1nyi. Gaussian approximation po-\ntentials: The accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett., 104(13):\n136403, 2010.\n\n8\n\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\nb\n\fJ\u00f6rg Behler. Neural network potential-energy surfaces in chemistry: a tool for large-scale simula-\n\ntions. Physical Chemistry Chemical Physics, 13(40):17930\u201317955, 2011.\n\nLorenz C. Blum and Jean-Louis Reymond. 970 million druglike small molecules for virtual screen-\ning in the chemical universe database GDB-13. Journal of the American Chemical Society, 131\n(25):8732\u20138733, 2009.\n\nDan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, and J\u00fcrgen Schmidhuber. Deep, big,\nsimple neural nets for handwritten digit recognition. Neural Computation, 22(12):3207\u20133220,\n2010.\n\nRonan Collobert, Jason Weston, L\u00e9on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel\nKuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Re-\nsearch, 12:2493\u20132537, 2011.\n\nDennis DeCoste and Bernhard Sch\u00f6lkopf. Training invariant support vector machines. Machine\n\nLearning, 46(1\u20133):161\u2013190, 2002.\n\nSiamac Fazli, M\u00e1rton Dan\u00f3czy, J\u00fcrg Schelldorfer, and Klaus-Robert M\u00fcller. \u21131-penalized linear\nmixed-effects models for high dimensional data with application to BCI. NeuroImage, 56(4):\n2100\u20132108, 2011.\n\nRajarshi Guha, Michael T. Howard, Geoffrey R. Hutchison, Peter Murray-Rust, Henry Rzepa,\nChristoph Steinbeck, J\u00f6rg Wegner, and Egon L. Willighagen. The blue obelisk, interoperabil-\nity in chemical informatics. Journal of Chemical Information and Modeling, 46(3):991\u2013998,\n2006.\n\nTrevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.\n\nSpringer Series in Statistics. Springer New York Inc., 2001.\n\nGeoffroy Hautier, Christopher C. Fisher, Anubhav Jain, Tim Mueller, and Gerbrand Ceder. Finding\nnature\u2019s missing ternary oxide compounds using machine learning and density functional theory.\nChemistry of Materials, 22(12):3762\u20133767, 2010.\n\nNavdeep Jaitly and Geoffrey E. Hinton. Learning a better representation of speech soundwaves\n\nusing restricted Boltzmann machines. In ICASSP, pages 5884\u20135887, 2011.\n\nYann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nS\u00f6nke Lorenz, Axel Gro\u00df, and Matthias Schef\ufb02er. Representing high-dimensional potential-energy\nsurfaces for reactions at surfaces by neural networks. Chemical Physics Letters, 395(4\u20136):210\u2013\n215, 2004.\n\nSergei Manzhos and Tucker Carrington. A random-sampling high dimensional model representation\n\nneural network for building potential energy surfaces. J. Chem. Phys., 125:084109, 2006.\n\nJos\u00e9 C. Pinheiro and Douglas M. Bates. Mixed-Effects Models in S and S-Plus. Springer, New York,\n\n2000.\n\nAnthony K. Rapp\u00e9, Carla J. Casewit, K. S. Colwell, William A. Goddard, and W. M. Skiff. UFF,\na full periodic table force \ufb01eld for molecular mechanics and molecular dynamics simulations.\nJournal of the American Chemical Society, 114(25):10024\u201310035, 1992.\n\nMatthias Rupp, Alexandre Tkatchenko, Klaus-Robert M\u00fcller, and O. Anatole von Lilienfeld. Fast\nand accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett.,\n108(5):058301, 2012.\n\nPatrice Simard, Yann LeCun, John S. Denker, and Bernard Victorri. Transformation invariance in\npattern recognition: Tangent distance and tangent propagation. In Neural Networks: Tricks of the\nTrade, pages 239\u201327, 1996.\n\nAlex J. Smola and Bernd Sch\u00f6lkopf. A tutorial on support vector regression. Statistics and comput-\n\ning, 14(3):199\u2013222, 2004.\n\nRoberto Todeschini and Viviana Consonni. Handbook of Molecular Descriptors. Wiley-VCH,\n\nWeinheim, Germany, second edition, 2009.\n\nO Anatole Von Lilienfeld and Mark E. Tuckerman. Molecular grand-canonical ensemble density\nfunctional theory and exploration of chemical space. The Journal of chemical physics, 125(15):\n154104, 2006.\n\n9\n\n\f", "award": [], "sourceid": 223, "authors": [{"given_name": "Gr\u00e9goire", "family_name": "Montavon", "institution": null}, {"given_name": "Katja", "family_name": "Hansen", "institution": null}, {"given_name": "Siamac", "family_name": "Fazli", "institution": null}, {"given_name": "Matthias", "family_name": "Rupp", "institution": null}, {"given_name": "Franziska", "family_name": "Biegler", "institution": null}, {"given_name": "Andreas", "family_name": "Ziehe", "institution": null}, {"given_name": "Alexandre", "family_name": "Tkatchenko", "institution": null}, {"given_name": "Anatole", "family_name": "Lilienfeld", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}]}