{"title": "N-Gram Graph: Simple Unsupervised Representation for Graphs, with Applications to Molecules", "book": "Advances in Neural Information Processing Systems", "page_first": 8466, "page_last": 8478, "abstract": "Machine learning techniques have recently been adopted in various applications in medicine, biology, chemistry, and material engineering. An important task is to predict the properties of molecules, which serves as the main subroutine in many downstream applications such as virtual screening and drug design. Despite the increasing interest, the key challenge is to construct proper representations of molecules for learning algorithms. This paper introduces the N-gram graph, a simple unsupervised representation for molecules. The method first embeds the vertices in the molecule graph. It then constructs a compact representation for the graph by assembling the vertex embeddings in short walks in the graph, which we show is equivalent to a simple graph neural network that needs no training. The representations can thus be efficiently computed and then used with supervised learning methods for prediction. Experiments on 60 tasks from 10 benchmark datasets demonstrate its advantages over both popular graph neural networks and traditional representation methods. This is complemented by theoretical analysis showing its strong representation and prediction power.", "full_text": "N-Gram Graph: Simple Unsupervised Representation\n\nfor Graphs, with Applications to Molecules\n\nShengchao Liu, Mehmet Furkan Demirel, Yingyu Liang\n\nDepartment of Computer Sciences, University of Wisconsin-Madison, Madison, WI\n\n{shengchao, demirel, yliang}@cs.wisc.edu\n\nAbstract\n\nMachine learning techniques have recently been adopted in various applications\nin medicine, biology, chemistry, and material engineering. An important task is\nto predict the properties of molecules, which serves as the main subroutine in\nmany downstream applications such as virtual screening and drug design. Despite\nthe increasing interest, the key challenge is to construct proper representations\nof molecules for learning algorithms. This paper introduces N-gram graph, a\nsimple unsupervised representation for molecules. The method \ufb01rst embeds the\nvertices in the molecule graph. It then constructs a compact representation for the\ngraph by assembling the vertex embeddings in short walks in the graph, which we\nshow is equivalent to a simple graph neural network that needs no training. The\nrepresentations can thus be ef\ufb01ciently computed and then used with supervised\nlearning methods for prediction. Experiments on 60 tasks from 10 benchmark\ndatasets demonstrate its advantages over both popular graph neural networks and\ntraditional representation methods. This is complemented by theoretical analysis\nshowing its strong representation and prediction power.\n\n1\n\nIntroduction\n\nIncreasingly, sophisticated machine learning methods have been used in non-traditional application\ndomains like medicine, biology, chemistry, and material engineering [14, 11, 16, 9]. This paper\nfocuses on a prototypical task of predicting properties of molecules. A motivating example is\nvirtual screening for drug discovery. Traditional physical screening for drug discovery (i.e., selecting\nmolecules based on properties tested via physical experiments) is typically accurate and valid, but\nalso very costly and slow. In contrast, virtual screening (i.e., selecting molecules based on predicted\nproperties via machine learning methods) can be done in minutes for predicting millions of molecules.\nTherefore, it can be a good \ufb01ltering step before the physical experiments, to help accelerate the drug\ndiscovery process and signi\ufb01cantly reduce resource requirements. The bene\ufb01ts gained then depend\non the prediction performance of the learning algorithms.\nA key challenge is that raw data in these applications typically are not directly well-handled by existing\nlearning algorithms and thus suitable representations need to be constructed carefully. Unlike image\nor text data where machine learning (in particular deep learning) has led to signi\ufb01cant achievements,\nthe most common raw inputs in molecule property prediction problems provide only highly abstract\nrepresentations of chemicals (i.e., graphs on atoms with atom attributes).\nTo address the challenge, various representation methods have been proposed, mainly in two cat-\negories. The \ufb01rst category is chemical \ufb01ngerprints, the most widely used feature representations\nin aforementioned domains. The prototype is the Morgan \ufb01ngerprints [42] (see Figure S1 for an\nexample). The second category is graph neural networks (GNN) [25, 2, 33, 47, 26, 58]. They view\nmolecules as graphs with attributes, and build a computational network tailored to the graph structure\nthat constructs an embedding vector for the input molecule and feeds it into a predictor (classi\ufb01er or\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fregression model). The network is trained end-to-end on labeled data, learning the embedding and\nthe predictor at the same time.\nThese different representation methods have their own advantages and disadvantages. The \ufb01ngerprints\nare simple and ef\ufb01cient to calculate. They are also unsupervised and thus each molecule can be\ncomputed once and used by different machine learning methods for different tasks. Graph neural\nnetworks in principle are more powerful: they can capture comprehensive information for molecules,\nincluding the skeleton structure, conformational information, and atom properties; they are trained\nend-to-end, potentially resulting in better representations for prediction. On the other hand, they need\nto be trained via supervised learning with suf\ufb01cient labeled data, and for a new task the representation\nneeds to retrained. Their training is also highly non-trivial and can be computationally expensive.\nSo a natural question comes up: can we combine the bene\ufb01ts by designing a simple and ef\ufb01cient\nunsupervised representation method with great prediction performance?\nTo achieve this, this paper introduces an unsupervised representation method called N-gram graph.\nIt views the molecules as graphs and the atoms as vertices with attributes. It \ufb01rst embeds the vertices\nby exploiting their special attribute structure. Then, it enumerates n-grams in the graph where an\nn-gram refers to a walk of length n, and constructs the embedding for each n-gram by assembling\nthe embeddings of its vertices. The \ufb01nal representation is constructed based on the embeddings\nof all its n-grams. We show that the graph embedding step can also be formulated as a simple\ngraph neural network that has no parameters and thus requires no training. The approach is ef\ufb01cient,\nproduces compact representations, and enjoys strong representation and prediction power shown\nby our theoretical analysis. Experiments on 60 tasks from 10 benchmark datasets show that it gets\noverall better performance than both classic representation methods and several recent popular graph\nneural networks.\n\nRelated Work. We brie\ufb02y describe the most related ones here due to space limitation and include a\nmore complete review in Appendix A. Firstly, chemical \ufb01ngerprints have long been used to represent\nmolecules, including the classic Morgan \ufb01ngerprints [42]. They have recently been used with deep\nlearning models [38, 52, 37, 31, 27, 34]. Secondly, graph neural networks are recent deep learning\nmodels designed speci\ufb01cally for data with graph structure, such as social networks and knowledge\ngraphs. See Appendix B for some brief introduction and refer to the surveys [30, 61, 57] for more\ndetails. Since molecules can be viewed as structured graphs, various graph neural networks have\nbeen proposed for them. Popular ones include [2, 33, 47, 26, 58]. Finally, graph kernel methods can\nalso be applied (e.g., [48, 49]). The implicit feature mapping induced by the kernel can be viewed as\nthe representation for the input. The Weisfeiler-Lehman kernel [49] is particularly related due to its\nef\ufb01ciency and theoretical backup. It is also similar in spirit to the Morgan \ufb01ngerprints and closely\nrelated to the recent GIN graph neural network [58].\n\n2 Preliminaries\n\nRaw Molecule Data. This work views a molecule as a graph, where each atom is a vertex and each\nbond is an edge. Suppose there are m vertices in the graph, denoted as i \u2208 {0, 1, ..., m \u2212 1}. Each\nvertex has useful attribute information, like the atom symbol and number of charges in the molecular\ngraphs. These vertex attributes are encoded into a vertex attribute matrix V of size m \u00d7 S, where S is\nthe number of attributes. An example of the attributes for vertex i is:\n\nVi,\u00b7 = [Vi,0,Vi,1, . . . ,Vi,6,Vi,7]\n\nwhere Vi,0 is the atom symbol, Vi,1 counts the atom degree, Vi,6 and Vi,7 indicate if it is an acceptor\nor a donor. Details are listed in Appendix E. Note that the attributes typically have discrete values.\nThe bonding information is encoded into the adjacency matrix A \u2208 {0, 1}m\u00d7m, where Ai,j = 1 if\nand only if two vertices i and j are linked. We let G = (V,A) denote a molecular graph. Sometimes\nthere are additional types of information, like bonding types and pairwise atom distance in the 3D\nEuclidean space used by [33, 47, 26], which are beyond the scope of this work.\n\nN-gram Approach.\nIn natural language processing (NLP), an n-gram refers to a consecutive\nsequence of words. For example, the 2-grams of the sentence \u201cthe dataset is large\u201d are {\u201cthe dataset\u201d,\n\u201cdataset is\u201d, \u201cis large\u201d}. The N-gram approach constructs a representation vector c(n) for the sentence,\nwhose coordinates correspond to all n-grams and the value of a coordinate is the number of times\n\n2\n\n\fthe corresponding n-gram shows up in the sentence. Therefore, the dimension of an n-gram vector\nis |V |n for a vocabulary V , and the vector c(1) is just the count vector of the words in the sentence.\nThe n-gram representation has been shown to be a strong baseline (e.g., [53]). One drawback is\nits high dimensionality, which can be alleviated by using word embeddings. Let W be a matrix\nwhose i-th column is the embedding of the i-th word. Then f(1) = W c(1) is just the sum of the\nword vectors in the sentence, which is in lower dimension and has also been shown to be a strong\nbaseline (e.g., [55, 6]). In general, an n-gram can be embedded as the element-wise product of the\nword vectors in it. Summing up all n-gram embeddings gives the embedding vector f(n). This has\nbeen shown both theoretically and empirically to preserve good information for downstream learning\ntasks even using random word vectors (e.g., [3]).\n\n3 N-gram Graph Representation\n\nOur N-gram graph method consists of two steps: \ufb01rst embed the vertices, and then embed the graph\nbased on the vertex embedding.\n\n3.1 Vertex Embedding\n\nNeighbor\nNeighbor\n\nNeighbor\n\n. . .\n. . .\n\nPadding\n\n. . .\n\nW\nW\n\nW\n\n1\n\n2\n\nVertex i\n. . .\n\nFigure 1: The CBoW-like neural network g. Each small box represents one attribute, and the gray color\nrepresents the bit one since it is one-hot encoded. Each long box consists of S attributes with length K. 1 is\nthe summation of all the embeddings of the neighbors of vertex i, where W \u2208 Rr\u00d7K is the vertex embedding\nmatrix. 2 is a fully-connected neural network, and the \ufb01nal predictions are the attributes of vertex i. s\n\nkj, and let K = (cid:80)S\u22121\n\nThe typical method to embed vertices in graphs is to view each vertex as one token and apply\nan analog of CBoW [41] or other word embedding methods (e.g., [28]). Here we propose our\nvariant that utilizes the structure that each vertex has several attributes of discrete values.1 Recall\nthat there are S attributes; see Section 2. Suppose the j-th attribute takes values in a set of size\ni denote a one-hot vector encoding the j-th attribute of vertex\ni, and let hi \u2208 RK be the concatenation hi = [h0\n]. Given an embedding dimension r,\nwe would like to learn matrices W j \u2208 Rr\u00d7kj whose (cid:96)-th column is an embedding vector for the\n(cid:96)-th value of the j-th attribute. Once they are learned, we let W \u2208 Rr\u00d7K be the concatenation\nW = [W 0, W 1, . . . , W S\u22121], and de\ufb01ne the representation for vertex i as\n\ni ; . . . ; hS\u22121\n\nj=0 kj. Let hj\n\ni\n\nfi = W hi.\n\n(1)\n\nNow it is suf\ufb01cient to learn the vertex embedding matrix W . We use a CBoW-like pipeline; see\nAlgorithm 1. The intuition is to make sure the attributes hi of a vertex i can be predicted from the\nhj\u2019s in its neighborhood. Let Ci denote the set of vertices linked to i. We will train a neural network\n\u02c6hi = g({hj : j \u2208 Ci}) so that its output matches hi. As speci\ufb01ed in Figure 1, the network g \ufb01rst\nW hj and then goes through a fully connected network with parameter \u03b8 to get \u02c6hi.\nGiven a dataset S = {Gj = (Vj,Aj)}, the training is by minimizing the cross-entropy loss:\n] = g({hj : j \u2208 Ci}).\n\ncomputes(cid:80)\n(cid:88)\n(cid:88)\n\ncross-entropy(h(cid:96)\n\ni ), subject to [\u02c6h0\n\ni ; . . . ; \u02c6hS\u22121\n\n(cid:88)\n\ni , \u02c6h(cid:96)\n\nj\u2208Ci\n\nmin\nW,\u03b8\n\nG\u2208S\n\ni\u2208G\n\n0\u2264(cid:96)~~ 1 and f(n) with n > 1. We summarize the results below and present the details in Appendix C.\n\ni\u2208p fi =(cid:80)\n(cid:81)\n\ni fi = W(cid:80)\n\nfollows (slightly generalizing [3]). Recall that S is the number of attributes, and K =(cid:80)S\u22121\n\nRepresentation Power. Given a graph, let us de\ufb01ne the bag-of-n-cooccurrences vector c(n) as\nj=0 kj\nwhere kj is the number of possible values for the j-th attribute, and the value on the i-th vertex is\ndenoted as Vi,j.\n\np:1-gram\n\nn ) is de\ufb01ned as the one-\nDe\ufb01nition 1 Given a walk p = (i1, . . . , in) of length n, the vector e(j)\nhot vector for the j-th attribute values {Vi1,j, . . . ,Vin,j} along the walk. The bag-of-n-cooccurrences\nvector c(n) is the concatenation of c(0)\np with the sum over all\npaths p of length n. Furthermore, let the count statistics c[T ] be the concatenation of c(1), . . . , c(T ).\n\n(n), . . . , c(S\u22121)\n\n, where c(j)\n\np e(j)\n\n(n)\n\np \u2208 R(kj\n\n(n) = (cid:80)\n(cid:1). The following theorem then shows\n(cid:0)kj\n\nthat f(n) is a compressed version and preserves the information of bag-of-n-cooccurrences.\n\nSo c(j)\n\nnation over all the attributes. It is in high dimension(cid:80)S\u22121\n\nj=0\n\nn\n\n(n) is the histogram of different values of the j-th attribute along the path, and c(n) is a concate-\n\nTheorem 1 If r = \u2126(ns3\nn log K) where sn is the sparsity of c(n), then there is a prior distribution\nover W so that f(n) = T(n)c(n) for a linear mapping T(n). If additionally c(n) is the sparsest vector\nsatisfying f(n) = T(n)c(n), then with probability 1 \u2212 O(S exp(\u2212(r/S)1/3)), c(n) can be ef\ufb01ciently\nrecovered from f(n).\n\nThe sparsity assumption of c(n) can be relaxed to be close to the sparsest vector (e.g., dense but only\na few coordinates have large values), and then c(n) can be approximately recovered. This assumption\nis justi\ufb01ed by the fact that there are a large number of possible types of n-gram while only a fraction\nof them are presented frequently in a graph. The prior distribution on W can be from a wide family\nof distributions; see the proof in Appendix C. This can also help explain that using random vertex\nembeddings in our method can also lead to good prediction performance; see Section 5. In practice,\nthe W is learned and potentially captures better similarities among the vertices.\nThe theorem means that fG preserves the information of the count statistics c(n)(1 \u2264 n \u2264 T ). Note\nthat typically, there are no two graphs having exactly the same count statistics, so the graph G can\nbe recovered from fG. For example, consider a linear graph b \u2212 c \u2212 d \u2212 a, whose 2-grams are\n(b, c), (c, d), (d, a). From the 2-grams, it is easy to reconstruct the graph. In such cases, fG can be\nused to recover G, i.e., fG has full representation power of G.\n\nPrediction Power. Consider a prediction task and let (cid:96)D(g) denote the risk of a prediction function\ng over the data distribution D.\nTheorem 2 Let gc be a prediction function on the count statistics c[T ]. In the same setting as in\nTheorem 1, with probability 1\u2212 O(T S exp(\u2212(r/S)1/3)), there is a function gf on the N-gram graph\nembeddings fG with risk (cid:96)D(gf ) = (cid:96)D(gc).\n\nSo there always exists a predictor on our embeddings that has performance as good as any predictor\non the count statistics. As mentioned, in typical cases, the graph G can be recovered from the counts.\n\n2We don\u2019t present the version of our method excluding such walks due to its higher computational cost.\n\n5\n\n\fThen there is always a predictor as good as the best predictor on the raw input G. Of course, one\nwould like that not only fG has full information but also the information is easy to exploit. Below we\nprovide the desired guarantee for the standard model of linear classi\ufb01ers with (cid:96)2-regularization.\nConsider the binary classi\ufb01cation task with the logistic loss function (cid:96)(g, y) where g is the prediction\nand y is the true label. Let (cid:96)D(\u03b8) = ED[(cid:96)(g\u03b8, y)] denote the risk of a linear classi\ufb01er g\u03b8 with weight\nvector \u03b8 over the data distribution D. Let \u03b8\u2217 denote the weight of the classi\ufb01er over c[n] minimizing\ni=1 i.i.d. sampled from D, and \u02c6\u03b8 is the weight over fG\n(cid:96)D. Suppose we have a dataset {(Gi, yi)}M\nM(cid:88)\nwhich is learned via (cid:96)2-regularization with regularization coef\ufb01cient \u03bb:\n(cid:96)((cid:104)\u03b8, fGi(cid:105), yi) + \u03bb(cid:107)\u03b8(cid:107)2.\n\n\u02c6\u03b8 = arg min\n\n(4)\n\n1\nM\n\n\u03b8\n\ni=1\n\nTheorem 3 Assume that fG is scaled so that (cid:107)fG(cid:107)2 \u2264 1 for any graph from D. There exists a prior\nlog K) for smax = max{sn : 1 \u2264 n \u2264 T} and\ndistribution over W , such that with r = \u2126( ns3\nmax\n\u00012\nappropriate choice of regularization coef\ufb01cient, with probability 1 \u2212 \u03b4 \u2212 O(T S exp(\u2212(r/S)1/3)),\nthe \u02c6\u03b8 minimizing the (cid:96)2-regularized logistic loss over the N-gram graph embeddings fGi\u2019s satis\ufb01es\n\n(cid:32)\n\n(cid:114)\n\n(cid:33)\n\n(cid:96)D(\u02c6\u03b8) \u2264 (cid:96)D(\u03b8\u2217) + O\n\n(cid:107)\u03b8\u2217(cid:107)2\n\n\u0001 +\n\n1\nM\n\nlog\n\n1\n\u03b4\n\n.\n\n(5)\n\nTherefore, the linear classi\ufb01er over the N-gram embeddings learned via the standard (cid:96)2-regularization\nhas performance close to the best one on the count statistics. In practice, the label may depend\nnonlinearly on the count statistics or the embeddings, so one prefers more sophisticated models.\nEmpirically, we can show that indeed the information in our embeddings can be ef\ufb01ciently exploited\nby classical methods like random forests and XGBoost.\n\n5 Experiments\n\nHere we evaluate the N-gram graph method on 60 molecule property prediction tasks, comparing\nwith three types of representations: Weisfeiler-Lehman Kernel, Morgan \ufb01ngerprints, and several\nrecent graph neural networks. The results show that N-gram graph achieves better or comparable\nperformance to the competitors.\nMethods.3 Table 1 lists the feature representation and model combinations. Weisfeiler-Lehman\n(WL) Kernel [49], Support Vector Machine (SVM), Morgan Fingerprints, Random Forest (RF),\nand XGBoost (XGB) [15] are chosen since they are the prototypical representation and learning\nmethods in these domains. Graph CNN (GCNN) [2], Weave Neural Network (Weave) [33], and\nGraph Isomorphism Network (GIN) [58] are end-to-end graph neural networks, which are recently\nproposed deep learning models for handling molecular graphs.\n\nTable 1: Feature representation for each different machine learning model. Both Morgan \ufb01ngerprints and N-gram\ngraph are used with Random Forest (RF) and XGBoost (XGB).\n\nFeature Representation\n\nWeisfeiler-Lehman Graph Kernel\n\nMorgan Fingerprints\nGraph Neural Network\n\nN-Gram Graph\n\nModel\nSVM\n\nRF, XGB\n\nRF, XGB\n\nGCNN, Weave, GIN\n\nDatasets. We test 6 regression and 4 classi\ufb01cation datasets, each with multiple tasks. Since our\nfocus is to compare the representations of the graphs, no transfer learning or multi-task learning is\nconsidered. In other words, we are comparing each task independently, which gives us 28 regression\ntasks and 32 classi\ufb01cation tasks in total. See Table S5 for a detailed description of the attributes for\nthe vertices in the molecular graphs from these datasets. All datasets are split into \ufb01ve folds and with\ncross-validation results reported as follows.\n\n3The code is available at https://github.com/chao1224/n_gram_graph. Baseline implementation\n\nfollows [21, 44].\n\n6\n\n\f\u2022 Regression datasets: Delaney [18], Malaria [23], CEP [29], QM7 [8], QM8 [43], QM9 [46].\n\u2022 Classi\ufb01cation datasets: Tox21 [51], ClinTox [24, 7], MUV [45], HIV [1].\n\nEvaluation Metrics. Same evaluation metrics are utilized as in [56]. Note that as illustrated in\nAppendix D, labels are highly skewed for each classi\ufb01cation task, and thus ROC-AUC or PR-AUC is\nused to measure the prediction performance instead of accuracy.\nHyperparameters. We tune the hyperparameter carefully for all representation and modeling\nmethods. More details about hyperparameters are provided in Section Appendix F. The following\nsubsections display results with the N-gram parameter T = 6 and the embedding dimension r = 100.\n\nTable 2: Performance overview: (# of tasks with top-1 performance, # of tasks with top-3 performance) is listed\nfor each model and each dataset. For cases with no top-3 performance on that dataset are left blank. Some\nmodels are not well tuned or too slow and are left in \u201c-\u201d.\n\nDataset\n\n# Task\n\nEval Metric\n\nWL\nSVM\n\nMorgan\n\nRF\n\nMorgan\nXGB\n\nGCNN Weave\n\nGIN\n\nN-Gram\n\nRF\n0, 1\n0, 1\n0, 1\n0, 1\n0, 2\n0, 8\n3, 12\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n0, 7\n\n0, 7\n\n2, 4\n0, 1\n5, 31\n\nDelaney\nMalaria\n\nCEP\nQM7\nQM8\nQM9\nTox21\nclintox\nMUV\nHIV\n\nOverall\n\n1\n1\n1\n1\n12\n12\n12\n2\n17\n1\n60\n\nRMSE\nRMSE\nRMSE\nMAE\nMAE\nMAE\n\nROC-AUC\nROC-AUC\nPR-AUC\nROC-AUC\n\n1, 1\n1, 1\n\n1, 4\n\n0, 7\n\n5, 11\n1, 1\n9, 25\n\n\u2013\n0, 2\n0, 1\n4, 12\n\n4, 15\n\n1, 1\n\n0, 1\n2, 6\n1, 8\n0, 1\n0, 1\n\n7, 12\n4, 7\n0, 2\n1, 2\n\n12, 23\n\n4, 18\n\n0, 1\n0, 1\n\n5, 11\n\n5, 13\n\nN-Gram\n\nXGB\n0, 1\n0, 1\n0, 1\n1, 1\n2, 11\n7, 12\n9, 12\n1, 2\n1, 6\n0, 1\n21, 48\n\n(a) ROC-AUC of the best models on Tox21 (Morgan+RF, GCNN, N-gram+XGB). Larger is better.\n\n(b) MAE of the best models on QM9 (GCNN, Weave, N-gram+XGB). Smaller is better.\n\nFigure 2: Performance of the best models on the datasets Tox21 and QM9, averaged over 5-fold cross-validation.\n\nPerformance. Table 2 summarizes the prediction performance of the methods on all 60 tasks. Since\n(1) no method can consistently beat all other methods on all tasks, and (2) for datasets like QM8, the\nerror (MAE) of the best models are all close to 0, we report both the top-1 and top-3 number of tasks\neach method obtained. Such high-level overview can help better understand the model performance.\nComplete results are included in Appendix H.\nOverall, we observe that N-gram graph, especially using XGBoost, shows better performance than\nthe other methods. N-gram with XGBoost is in top-1 for 21 out of 60 tasks, and is in top-3 for 48.\nOn some tasks, the margin is not large but the advantage is consistent; see for example the tasks on\nthe dataset Tox21 in Figure 2(a). On some tasks, the advantage is signi\ufb01cant; see for example the\ntasks u0, u298, h298, g298 on the dataset QM9 in Figure 2(b).\nWe also observe that random forest on Morgan \ufb01ngerprints has performance beyond general expec-\ntation, in particular, better than the recent graph neural network models on the classi\ufb01cation tasks.\nOne possible explanation is that we have used up to 4000 trees and obtained improved performance\ncompared to 75 trees as in [56], since the number of trees is the most important parameter as pointed\nout in [37]. It also suggests that Morgan \ufb01ngerprints indeed contain suf\ufb01cient amount of information\nfor the classi\ufb01cation tasks, and methods like random forest are good at exploiting them.\n\n7\n\nNR-ARNR-AR-LBDNR-AhRNR-AromataseNR-ERNR-ER-LBDNR-PPAR-gammaSR-ARESR-ATAD5SR-HSESR-MMPSR-p530.60.70.80.91.0ROC-AUCMorgan, RFGCNNN-Gram, XGB50100GCNNWeaveN-Gram, XGB24MAEmualphahomolumogapr2zpvecvu0u298h298g2980.000.01\fTransferable Vertex Embedding. An intriguing property of the vertex embeddings is that they can\nbe transferred across datasets. We evaluate N-gram graph with XGB on Tox21, using different vertex\nembeddings: trained on Tox21, random, or trained on other datasets. See details in Appendix G.1.\nTable 3 shows that embeddings from other datasets can be used to get comparable results. Even\nrandom embeddings can get good results, which is explained in Section 4.\n\nTable 3: AUC-ROC of N-Gram graph with XGB on 12 tasks from Tox21. Six vertex embeddings are considered:\nnon-transfer (trained on Tox21), vertex embeddings generated randomly and learned from 4 other datasets.\n\nNR-AR\n\nNR-AR-LBD\n\nNR-AhR\n\nNR-Aromatase\n\nNR-ER\n\nNR-ER-LBD\n\nNR-PPAR-gamma\n\nSR-ARE\nSR-ATAD5\nSR-HSE\nSR-MMP\nSR-p53\n\nNon-Transfer\n\n0.791\n0.864\n0.902\n0.869\n0.753\n0.838\n0.851\n0.835\n0.860\n0.812\n0.918\n0.868\n\nRandom\n0.790\n0.846\n0.895\n0.858\n0.751\n0.820\n0.809\n0.823\n0.830\n0.777\n0.909\n0.856\n\nDelaney\n0.785\n0.863\n0.903\n0.867\n0.752\n0.843\n0.862\n0.841\n0.844\n0.806\n0.918\n0.869\n\nCEP\n0.787\n0.849\n0.892\n0.848\n0.740\n0.820\n0.813\n0.814\n0.817\n0.768\n0.902\n0.841\n\nMUV\n0.796\n0.864\n0.901\n0.858\n0.735\n0.827\n0.832\n0.835\n0.845\n0.805\n0.916\n0.856\n\nClintox\n0.780\n0.867\n0.903\n0.866\n0.747\n0.847\n0.857\n0.842\n0.857\n0.810\n0.919\n0.870\n\nComputational Cost. Table 4 depicts the construction time of representations by different methods.\nSince vertex embeddings can be amortized across different tasks on the same dataset or even\ntransferred, the main runtime of our method is from the graph embedding step. It is relatively\nef\ufb01cient, much faster than the GNNs and the kernel method, though Morgan \ufb01ngerprints can be even\nfaster.\n\nTable 4: Representation construction time in seconds. One task from each dataset as an example. Average over 5\nfolds, and including both the training set and test set.\n\nTask\n\nDelaney\nMalaria\n\nCEP\nQM7\nE1-CC2\n\nmu\n\nNR-AR\nCT-TOX\nMUV-466\n\nHIV\n\nDataset\n\nDelaney\nMalaria\n\nCEP\nQM7\nQM8\nQM9\nTox21\nClintox\nMUV\nHIV\n\nWL\nCPU\n2.46\n128.81\n1113.35\n60.24\n584.98\n\n\u2013\n\n70.35\n4.92\n276.42\n2284.74\n\nMorgan FPs\n\nCPU\n0.25\n5.28\n17.69\n0.98\n3.60\n19.58\n2.03\n0.63\n6.31\n17.16\n\nGCNN\nGPU\n39.70\n377.24\n607.23\n103.12\n382.72\n9051.37\n130.15\n62.61\n401.02\n1142.77\n\nWeave\nGPU\n65.82\n536.99\n849.37\n76.48\n262.16\n1504.77\n142.59\n95.50\n690.15\n2138.10\n\nGIN\nGPU\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\n608.57\n135.68\n1327.26\n3641.52\n\nVertex, Emb\n\nGPU\n49.63\n1152.80\n2695.57\n173.50\n966.49\n8279.03\n525.24\n191.93\n1221.25\n3975.76\n\nGraph, Emb\n\nGPU\n2.90\n19.58\n37.40\n10.60\n33.43\n169.72\n10.81\n3.83\n25.50\n139.85\n\nComparison to models using 3D information. What makes molecular graphs more complicated\nis that they contain 3D information, which is helpful for making predictions [26]. Deep Tensor Neural\nNetworks (DTNN) [47] and Message-Passing Neural Networks (MPNN) [26] are two graph neural\nnetworks that are able to utilize 3D information encoded in the datasets.4 Therefore, we further\ncompare our method to these two most advanced GNN models, on the two datasets QM8 and QM9\nthat have 3D information. The results are summarized in Table 5. The detailed results are in Table S17\nand the computational times are in Table S18. They show that our method, though not using 3D\ninformation, still gets comparable performance.\n\nTable 5: Comparison of model using 3D information. On two regression datasets QM8 and QM9, and evaluated\nby MAE. N-Gram does not include any spatial information, like the distance between each atom pair, yet its\nperformance is very comparative to the state-of-the-art methods.\nGCNN Weave\n\nDTNN MPNN\n\nN-Gram\n\nN-Gram\n\nMorgan\n\nDataset\n\n# Task\n\n4, 10\n0, 4\n4, 14\n\n0, 3\n0, 1\n0, 4\n\n0, 5\n7, 10\n7, 15\n\n5, 6\n1, 9\n6, 15\n\nRF\n0, 2\n0, 5\n0, 7\n\nXGB\n2, 5\n4, 7\n6, 12\n\nQM8\nQM9\nOverall\n\n12\n12\n24\n\n\u2013\n\nWL\nSVM\n\nRF\n1, 4\n\nMorgan\nXGB\n0, 1\n\n1, 4\n\n0, 1\n\n4Weave [33] is also using the distance matrix, but it is the distance on graph, i.e., the length of shortest path\n\nbetween each atom pair, not the 3D Euclidean distance.\n\n8\n\n\fEffect of r and T . We also explore the effect of the two key hyperparameters in N-gram graph:\nthe vertex embedding dimension r and the N-gram length T . Figure 3 shows the results of 12\nclassi\ufb01cation tasks on the Tox21 dataset, and Figure S2 shows the results on 3 regression tasks on\nthe datasets Delaney, Malaria, and CEP. They reveal that generally, r does not affect the model\nperformance while increasing T can bring in signi\ufb01cant improvement. More detailed discussions are\nin appendix K.\n\nFigure 3: Effects of vertex embedding dimension r and N-gram dimension T on 12 tasks from Tox21: the effect\nof r and T on ROC-AUC. x-axis: the hyperparameter T ; y-axis: ROC-AUC. Different lines correspond to\ndifferent methods and different values of r.\n\n6 Conclusion\n\nThis paper introduced a novel representation method called N-gram graph for molecule representation.\nIt is simple, ef\ufb01cient, yet gives compact representations that can be applied with different learning\nmethods. Experiments show that it can achieve overall better performance than prototypical traditional\nmethods and several recent graph neural networks.\nThe method was inspired by the recent word embedding methods and the traditional N-gram approach\nin natural language processing, and can be formulated as a simple graph neural network. It can also be\nused to handle general graph-structured data, such as social networks. Concrete future works include\napplications on other types of graph-structured data, pre-training and \ufb01ne-tuning vertex embeddings,\nand designing even more powerful variants of the N-gram graph neural network.\n\nAcknowledgements\n\nThis work was supported in part by FA9550-18-1-0166. The authors would also like to acknowledge\ncomputing resources from the University of Wisconsin-Madison Center for High Throughput Comput-\ning and support provided by the University of Wisconsin-Madison Of\ufb01ce of the Vice Chancellor for\nResearch and Graduate Education with funding from the Wisconsin Alumni Research Foundation.\n\n9\n\n0.7850.7900.7950.8000.805ROC-AUCNR-ARN-Gram, RFr=50N-Gram, RFr=100N-Gram, XGBr=50N-Gram, XGBr=1000.8500.8550.8600.8650.870NR-AR-LBD0.88750.89000.89250.89500.89750.90000.9025NR-AhR0.8550.8600.865ROC-AUCNR-Aromatase0.7350.7400.7450.750NR-ER0.8150.8200.8250.8300.8350.840NR-ER-LBD0.8300.8350.8400.8450.8500.855ROC-AUCNR-PPAR-gamma0.8200.8250.8300.8350.840SR-ARE0.8350.8400.8450.8500.855SR-ATAD5246T0.7900.7950.8000.8050.810ROC-AUCSR-HSE246T0.90250.90500.90750.91000.91250.91500.9175SR-MMP246T0.8500.8550.8600.865SR-p53\fReferences\n[1] Aids antiviral screen data. https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+\n\nAntiviral+Screen+Data. Accessed: 2017-09-27.\n\n[2] Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. Low data drug\n\ndiscovery with one-shot learning. ACS Central Science, 3(4):283\u2013293, 2017.\n\n[3] Sanjeev Arora, Mikhail Khodak, Nikunj Saunshi, and Kiran Vodrahalli. A compressed sensing\nview of unsupervised text embeddings, bag-of-n-grams, and lstm. International Conference on\nLearning Representations, 2018.\n\n[4] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. A latent vari-\nable model approach to pmi-based word embeddings. Transactions of the Association for\nComputational Linguistics, 4:385\u2013399, 2016.\n\n[5] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic\nstructure of word senses, with applications to polysemy. Transactions of the Association of\nComputational Linguistics, 6:483\u2013495, 2018.\n\n[6] Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence\n\nembeddings. In International Conference on Learning Representations, 2016.\n\n[7] Artem V Artemov, Evgeny Putin, Quentin Vanhaelen, Alexander Aliper, Ivan V Ozerov, and\nAlex Zhavoronkov. Integrated deep learned transcriptomic and structure-based predictor of\nclinical trials outcomes. bioRxiv, page 095653, 2016.\n\n[8] Lorenz C Blum and Jean-Louis Reymond. 970 million druglike small molecules for virtual\nscreening in the chemical universe database gdb-13. Journal of the American Chemical Society,\n131(25):8732\u20138733, 2009.\n\n[9] Keith T Butler, Daniel W Davies, Hugh Cartwright, Olexandr Isayev, and Aron Walsh. Machine\n\nlearning for molecular and materials science. Nature, 559(7715):547, 2018.\n\n[10] Robert Calderbank, Sina Jafarpour, and Robert Schapire. Compressed learning: Universal\nsparse dimensionality reduction and learning in the measurement domain. Techical Report,\n2009.\n\n[11] Diogo M Camacho, Katherine M Collins, Rani K Powers, James C Costello, and James J\n\nCollins. Next-generation machine learning for biological networks. Cell, 2018.\n\n[12] Emmanuel J Candes. The restricted isometry property and its implications for compressed\n\nsensing. Comptes rendus mathematique, 346(9-10):589\u2013592, 2008.\n\n[13] Emmanuel J CANDES and Terence TAO. Decoding by linear programming. IEEE transactions\n\non information theory, 51(12):4203\u20134215, 2005.\n\n[14] Hongming Chen, Ola Engkvist, Yinhai Wang, Marcus Olivecrona, and Thomas Blaschke. The\n\nrise of deep learning in drug discovery. Drug discovery today, 2018.\n\n[15] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of\nthe 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 785\u2013794. ACM, 2016.\n\n[16] Travers Ching, Daniel S Himmelstein, Brett K Beaulieu-Jones, Alexandr A Kalinin, Brian T Do,\nGregory P Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M Hoffman,\net al. Opportunities and obstacles for deep learning in biology and medicine. Journal of The\nRoyal Society Interface, 15(141):20170387, 2018.\n\n[17] George Dahl. Deep learning how i did it: Merck 1st place interview. Online article available\nfrom http://blog. kaggle. com/2012/11/01/deep-learning-how-i-did-it-merck-1st-place-interview,\n2012.\n\n[18] John S. Delaney. ESOL: Estimating Aqueous Solubility Directly from Molecular Structure.\n\nJournal of Chemical Information and Computer Sciences, 44(3):1000\u20131005, May 2004.\n\n10\n\n\f[19] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,\nAlan Aspuru-Guzik, and Ryan P Adams. Convolutional Networks on Graphs for Learning\nMolecular Fingerprints. pages 2224\u20132232, 2015.\n\n[20] Felix A Faber, Luke Hutchison, Bing Huang, Justin Gilmer, Samuel S Schoenholz, George E\nDahl, Oriol Vinyals, Steven Kearnes, Patrick F Riley, and O Anatole von Lilienfeld. Prediction\nerrors of molecular machine learning models lower than hybrid dft error. Journal of chemical\ntheory and computation, 13(11):5255\u20135264, 2017.\n\n[21] Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric.\n\nIn ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.\n\n[22] Simon Foucart and Holger Rauhut. A mathematical introduction to compressive sensing. Bull.\n\nAm. Math, 54:151\u2013165, 2017.\n\n[23] Francisco-Javier Gamo, Laura M. Sanz, Jaume Vidal, Cristina de Cozar, Emilio Alvarez,\nJose-Luis Lavandera, Dana E. Vanderwall, Darren V. S. Green, Vinod Kumar, Samiul Hasan,\nJames R. Brown, Catherine E. Peishoff, Lon R. Cardon, and Jose F. Garcia-Bustos. Thousands\nof chemical starting points for antimalarial lead identi\ufb01cation. Nature, 465(7296):305\u2013310,\nMay 2010.\n\n[24] Kaitlyn M Gayvert, Neel S Madhukar, and Olivier Elemento. A data-driven approach to\npredicting successes and failures of clinical trials. Cell chemical biology, 23(10):1294\u20131301,\n2016.\n\n[25] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl.\nNeural message passing for quantum chemistry. In Doina Precup and Yee Whye Teh, editors,\nProceedings of the 34th International Conference on Machine Learning, volume 70 of Pro-\nceedings of Machine Learning Research, pages 1263\u20131272, International Convention Centre,\nSydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[26] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural\nmessage passing for quantum chemistry. In Proceedings of the 34th International Conference\non Machine Learning-Volume 70, pages 1263\u20131272. JMLR. org, 2017.\n\n[27] Rafael G\u00f3mez-Bombarelli, Jennifer N Wei, David Duvenaud, Jos\u00e9 Miguel Hern\u00e1ndez-Lobato,\nBenjam\u00edn S\u00e1nchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel,\nRyan P Adams, and Al\u00e1n Aspuru-Guzik. Automatic chemical design using a data-driven\ncontinuous representation of molecules. ACS Central Science, 2016.\n\n[28] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks.\n\nIn\nProceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 855\u2013864. ACM, 2016.\n\n[29] Johannes Hachmann, Roberto Olivares-Amaya, Sule Atahan-Evrenk, Carlos Amador-Bedolla,\nRoel S. S\u00e1nchez-Carrera, Aryeh Gold-Parker, Leslie Vogt, Anna M. Brockway, and Al\u00e1n\nAspuru-Guzik. The Harvard Clean Energy Project: Large-Scale Computational Screening\nand Design of Organic Photovoltaics on the World Community Grid. The Journal of Physical\nChemistry Letters, 2(17):2241\u20132251, September 2011.\n\n[30] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods\n\nand applications. arXiv preprint arXiv:1709.05584, 2017.\n\n[31] Stanis\u0142aw Jastrz\u02dbebski, Damian Le\u00b4sniak, and Wojciech Marian Czarnecki. Learning to smile (s).\n\narXiv preprint arXiv:1602.06289, 2016.\n\n[32] Shiva Prasad Kasiviswanathan and Mark Rudelson. Restricted isometry property under high\n\ncorrelations. arXiv preprint arXiv:1904.05510, 2019.\n\n[33] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular\ngraph convolutions: moving beyond \ufb01ngerprints. Journal of computer-aided molecular design,\n30(8):595\u2013608, 2016.\n\n11\n\n\f[34] Matt J Kusner, Brooks Paige, and Jos\u00e9 Miguel Hern\u00e1ndez-Lobato. Grammar variational\n\nautoencoder. arXiv preprint arXiv:1703.01925, 2017.\n\n[35] Greg Landrum. Rdkit: Open-source cheminformatics software. 2016.\n\n[36] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural\n\nnetworks. arXiv preprint arXiv:1511.05493, 2015.\n\n[37] Shengchao Liu, Moayad Alnammi, Spencer S Ericksen, Andrew F Voter, James L Keck,\nF Michael Hoffmann, Scott A Wildman, and Anthony Gitter. Practical model selection for\nprospective virtual screening. bioRxiv, page 337956, 2018.\n\n[38] Junshui Ma, Robert P Sheridan, Andy Liaw, George E Dahl, and Vladimir Svetnik. Deep\nneural nets as a method for quantitative structure\u2013activity relationships. Journal of chemical\ninformation and modeling, 55(2):263\u2013274, 2015.\n\n[39] Matthew K. Matlock, Na Le Dang, and S. Joshua Swamidass. Learning a Local-Variable Model\n\nof Aromatic and Conjugated Systems. ACS Central Science, 4(1):52\u201362, January 2018.\n\n[40] Merck. Merck molecular activity challenge. https://www.kaggle.com/c/MerckActivity, 2012.\n\n[41] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013.\n\n[42] HL Morgan. The generation of a unique machine description for chemical structures-a technique\ndeveloped at chemical abstracts service. Journal of Chemical Documentation, 5(2):107\u2013113,\n1965.\n\n[43] Raghunathan Ramakrishnan, Mia Hartmann, Enrico Tapavicza, and O Anatole Von Lilienfeld.\nElectronic spectra from tddft and machine learning in chemical space. The Journal of chemical\nphysics, 143(8):084111, 2015.\n\n[44] Bharath Ramsundar, Peter Eastman, Patrick Walters, Vijay Pande, Karl Leswing, and Zhenqin\nWu. Deep Learning for the Life Sciences. O\u2019Reilly Media, 2019. https://www.amazon.com/\nDeep-Learning-Life-Sciences-Microscopy/dp/1492039837.\n\n[45] Sebastian G Rohrer and Knut Baumann. Maximum unbiased validation (muv) data sets for\nvirtual screening based on pubchem bioactivity data. Journal of chemical information and\nmodeling, 49(2):169\u2013184, 2009.\n\n[46] Lars Ruddigkeit, Ruud Van Deursen, Lorenz C Blum, and Jean-Louis Reymond. Enumeration\nof 166 billion organic small molecules in the chemical universe database gdb-17. Journal of\nchemical information and modeling, 52(11):2864\u20132875, 2012.\n\n[47] Kristof T Sch\u00fctt, Farhad Arbabzadah, Stefan Chmiela, Klaus R M\u00fcller, and Alexandre\nTkatchenko. Quantum-chemical insights from deep tensor neural networks. Nature com-\nmunications, 8:13890, 2017.\n\n[48] John Shawe-Taylor, Nello Cristianini, et al. Kernel methods for pattern analysis. Cambridge\n\nuniversity press, 2004.\n\n[49] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M\nJournal of Machine Learning Research,\n\nBorgwardt. Weisfeiler-lehman graph kernels.\n12(Sep):2539\u20132561, 2011.\n\n[50] Roberto Todeschini and Viviana Consonni. Molecular descriptors for chemoinformatics: volume\nI: alphabetical listing/volume II: appendices, references, volume 41. John Wiley & Sons, 2009.\n\n[51] Tox21 Data Challenge. Tox21 data challenge 2014. https://tripod.nih.gov/tox21/challenge/,\n\n2014.\n\n[52] Thomas Unterthiner, Andreas Mayr, G\u00fcnter Klambauer, Marvin Steijaert, J\u00f6rg K Wegner,\nHugo Ceulemans, and Sepp Hochreiter. Deep learning as an opportunity in virtual screening.\nAdvances in neural information processing systems, 27, 2014.\n\n12\n\n\f[53] Sida Wang and Christopher D Manning. Baselines and bigrams: Simple, good sentiment\nand topic classi\ufb01cation. In Proceedings of the 50th Annual Meeting of the Association for\nComputational Linguistics: Short Papers-Volume 2, pages 90\u201394. Association for Computational\nLinguistics, 2012.\n\n[54] David Weininger, Arthur Weininger, and Joseph L Weininger. Smiles. 2. algorithm for generation\nof unique smiles notation. Journal of Chemical Information and Computer Sciences, 29(2):97\u2013\n101, 1989.\n\n[55] John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. Towards universal paraphrastic\n\nsentence embeddings. arXiv preprint arXiv:1511.08198, 2015.\n\n[56] Zhenqin Wu, Bharath Ramsundar, Evan N Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S\nPappu, Karl Leswing, and Vijay Pande. Moleculenet: a benchmark for molecular machine\nlearning. Chemical Science, 9(2):513\u2013530, 2018.\n\n[57] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A\n\ncomprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.\n\n[58] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural\n\nnetworks? arXiv preprint arXiv:1810.00826, 2018.\n\n[59] Zi Yin and Yuanyuan Shen. On the dimensionality of word embedding. In Advances in Neural\n\nInformation Processing Systems, pages 895\u2013906, 2018.\n\n[60] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and Jure\nLeskovec. Hierarchical graph representation learning withdifferentiable pooling. arXiv preprint\narXiv:1806.08804, 2018.\n\n[61] Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph\nneural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434,\n2018.\n\n13\n\n\f", "award": [], "sourceid": 4579, "authors": [{"given_name": "Shengchao", "family_name": "Liu", "institution": "UW-Madison"}, {"given_name": "Mehmet", "family_name": "Demirel", "institution": "University of Wisconsin-Madison"}, {"given_name": "Yingyu", "family_name": "Liang", "institution": "University of Wisconsin Madison"}]}~~