{"title": "Wasserstein Weisfeiler-Lehman Graph Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 6439, "page_last": 6449, "abstract": "Most graph kernels are an instance of the class of R-Convolution kernels, which measure the similarity of objects by comparing their substructures.\nDespite their empirical success, most graph kernels use a naive aggregation of the final set of substructures, usually a sum or average, thereby potentially discarding valuable information about the distribution of individual components. Furthermore, only a limited instance of these approaches can be extended to continuously attributed graphs. \nWe propose a novel method that relies on the Wasserstein distance between the node feature vector distributions of two graphs, which allows to find subtler differences in data sets by considering graphs as high-dimensional objects, rather than simple means.\nWe further propose a Weisfeiler--Lehman inspired embedding scheme for graphs with continuous node attributes and weighted edges, enhance it with the computed Wasserstein distance, and thus improve the state-of-the-art prediction performance on several graph classification tasks.", "full_text": "Wasserstein Weisfeiler-Lehman Graph Kernels\n\nMatteo Togninalli1,2,\u21e4\n\nmatteo.togninalli@bsse.ethz.ch\n\nElisabetta Ghisu1,2,\u21e4\n\nelisabetta.ghisu@bsse.ethz.ch\n\nFelipe Llinares-L\u00f3pez1,2\n\nfelipe.llinares@bsse.ethz.ch\n\nBastian Rieck1,2\n\nbastian.rieck@bsse.ethz.ch\n\nKarsten Borgwardt1,2\n\nkarsten.borgwardt@bsse.ethz.ch\n\n1DEPARTMENT OF BIOSYSTEMS SCIENCE AND ENGINEERING, ETH ZURICH, SWITZERLAND\n\n2SIB SWISS INSTITUTE OF BIOINFORMATICS, SWITZERLAND\n\n\u21e4These authors contributed equally\n\nAbstract\n\nMost graph kernels are an instance of the class of R-Convolution kernels, which\nmeasure the similarity of objects by comparing their substructures. Despite their\nempirical success, most graph kernels use a naive aggregation of the \ufb01nal set of\nsubstructures, usually a sum or average, thereby potentially discarding valuable\ninformation about the distribution of individual components. Furthermore, only\na limited instance of these approaches can be extended to continuously attributed\ngraphs. We propose a novel method that relies on the Wasserstein distance between\nthe node feature vector distributions of two graphs, which allows \ufb01nding subtler\ndifferences in data sets by considering graphs as high-dimensional objects rather\nthan simple means. We further propose a Weisfeiler\u2013Lehman-inspired embedding\nscheme for graphs with continuous node attributes and weighted edges, enhance it\nwith the computed Wasserstein distance, and thereby improve the state-of-the-art\nprediction performance on several graph classi\ufb01cation tasks.\n\n1\n\nIntroduction\n\nGraph-structured data have become ubiquitous across domains over the last decades, with examples\nranging from social and sensor networks to chemo- and bioinformatics. Graph kernels [45] have\nbeen highly successful in dealing with the complexity of graphs and have shown good predictive\nperformance on a variety of classi\ufb01cation problems [27, 38, 47]. Most graph kernels rely on the\nR-Convolution framework [18], which decomposes structured objects into substructures to com-\npute local similarities that are then aggregated. Although being successful in several applications,\nR-Convolution kernels on graphs have limitations: (1) the simplicity of the way in which the similar-\nities between substructures are aggregated might limit their ability to capture complex characteristics\nof the graph; (2) most proposed variants do not generalise to graphs with high-dimensional continuous\nnode attributes, and extensions are far from being straightforward. Various solutions have been pro-\nposed to address point (1). For example, Fr\u00f6hlich et al. [15] introduced kernels based on the optimal\nassignment of node labels for molecular graphs, although these kernels are not positive de\ufb01nite [43].\nRecently, another approach was proposed by Kriege et al. [25], which employs a Weisfeiler\u2013Lehman\nbased colour re\ufb01nement scheme and uses an optimal assignment of the nodes to compute the kernel.\nHowever, this method cannot handle continuous node attributes, leaving point (2) as an open problem.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTo overcome both limitations, we propose a method that combines the most successful vectorial graph\nrepresentations derived from the graph kernel literature with ideas from optimal transport theory,\nwhich have recently gained considerable attention. In particular, improvements of the computational\nstrategies to ef\ufb01ciently obtain Wasserstein distances [1, 8] have led to many applications in machine\nlearning that use it for various purposes, ranging from generative models [2] to new loss functions [14].\nIn graph applications, notions from optimal transport were used to tackle the graph alignment problem\n[46]. In this paper, we provide the theoretical foundations of our method, de\ufb01ne a new graph kernel\nformulation, and present successful experimental results. Speci\ufb01cally, our main contributions can be\nsummarised as follows:\n\nnode feature representations, and we discuss how kernels can be derived from it.\n\n\u2022 We present the graph Wasserstein distance, a new distance between graphs based on their\n\u2022 We introduce a Weisfeiler\u2013Lehman-inspired embedding scheme that works for both cat-\negorically labelled and continuously attributed graphs, and we couple it with our graph\nWasserstein distance;\n\u2022 We outperform the state of the art for graph kernels on traditional graph classi\ufb01cation\n\nbenchmarks with continuous attributes.\n\n2 Background: graph kernels and Wasserstein distance\n\nIn this section, we introduce the notation that will be used throughout the manuscript. Moreover, we\nprovide the necessary background on graph kernel methods and the Wasserstein distance.\n\n2.1 Graph kernels\n\nKernels are a class of similarity functions that present attractive properties to be used in learning\nalgorithms [36]. Let X be a set and k : X\u21e5X! R be a function associated with a Hilbert space H,\n. Then, H is a reproducing\nsuch that there exists a map : X!H with k(x, y) = h(x), (y)iH\nkernel Hilbert space (RKHS) and k is said to be a positive de\ufb01nite kernel. A positive de\ufb01nite kernel\ncan be interpreted as a dot product in a high-dimensional space, thereby permitting its use in any\nlearning algorithm that relies on dot products, such as support vector machines (SVMs), by virtue of\nthe kernel trick [35]. Because ensuring positive de\ufb01niteness is not always feasible, many learning\nalgorithms were recently proposed to extend SVMs to inde\ufb01nite kernels [3, 26, 29, 30].\nWe de\ufb01ne a graph as a tuple G = (V, E), where V and E denote the set of nodes and edges,\nrespectively; we further assume that the edges are undirected. Moreover, we denote the cardinality\nof nodes and edges for G as |V | = nG and |E| = mG. For a node v 2 V , we write N (v) = {u 2\nV | (v, u) 2 E} and |N (v)| = deg(v) to denote its \ufb01rst-order neighbourhood. We say that a graph\nis labelled if its nodes have categorical labels. A label on the nodes is a function l : V ! \u2303 that\nassigns to each node v in G a value l(v) from a \ufb01nite label alphabet \u2303. Additionally, we say that a\ngraph is attributed if for each node v 2 V there exists an associated vector a(v) 2 Rm. In this paper,\na(v) are the node attributes and l(v) are the categorical node labels of node v. In particular, the node\nattributes are high-dimensional continuous vectors, whereas the categorical node labels are assumed\nto be integer numbers (encoding either an ordered discrete value or a category). With the term \u201cnode\nlabels\u201d, we will implicitly refer to categorical node labels. Finally, a graph can have weighted edges,\nand the function w : E ! R de\ufb01nes the weight w(e) of an edge e := (v, u) 2 E.\nKernels on graphs are generally de\ufb01ned using the R-Convolution framework by [18]. The main idea\nis to decompose graph G into substructures and to de\ufb01ne a kernel value k(G, G0) as a combination\nof substructure similarities. A pioneer kernel on graphs was presented by [19], where node and edge\nattributes are exploited for label sequence generation using a random walk scheme. Successively,\na more ef\ufb01cient approach based on shortest paths [5] was proposed, which computes each kernel\nvalue k(G, G0) as a sum of the similarities between each shortest path in G and each shortest path in\nG0. Despite the practical success of R-Convolution kernels, they often rely on aggregation strategies\nthat ignore valuable information, such as the distribution of the substructures. An example is the\nWeisfeiler\u2013Lehman (WL) subtree kernel or one of its variants [33, 37, 38], which generates graph-\nlevel features by summing the contribution of the node representations. To avoid these simpli\ufb01cations,\nwe want to use concepts from optimal transport theory, such as the Wasserstein distance, which can\nhelp to better capture the similarities between graphs.\n\n2\n\n\fp\n\n,\n\n2(,\u00b5)ZM\u21e5M\n\nWp(, \u00b5) :=\u2713 inf\n\n2.2 Wasserstein distance\nThe Wasserstein distance is a distance function between probability distributions de\ufb01ned on a given\nmetric space. Let and \u00b5 be two probability distributions on a metric space M equipped with a\nground distance d, such as the Euclidean distance.\nDe\ufb01nition 1. The Lp-Wasserstein distance for p 2 [1,1) is de\ufb01ned as\nd(x, y)p d(x, y)\u25c6 1\n\n(1)\nwhere (, \u00b5) is the set of all transportation plans 2 (, \u00b5) over M \u21e5 M with marginals and\n\u00b5 on the \ufb01rst and second factors, respectively.\nThe Wasserstein distance satis\ufb01es the axioms of a metric, provided that d is a metric (see the\nmonograph of Villani [44], chapter 6, for a proof). Throughout the paper, we will focus on the\ndistance for p = 1 and we will refer to the L1-Wasserstein distance when mentioning the Wasserstein\ndistance, unless noted otherwise.\nThe Wasserstein distance is linked to the optimal transport problem [44], where the aim is to \ufb01nd the\nmost \u201cinexpensive\u201d way, in terms of the ground distance, to transport all the probability mass from\ndistribution to match distribution \u00b5. An intuitive illustration can be made for the 1-dimensional\ncase, where the two probability distributions can be imagined as piles of dirt or sand. The Wasserstein\ndistance, sometimes also referred to as the earth mover\u2019s distance [34], can be interpreted as the\nminimum effort required to move the content of the \ufb01rst pile to reproduce the second pile.\nIn this paper, we deal with \ufb01nite sets of node embeddings and not with continuous probability\ndistributions. Therefore, we can reformulate the Wasserstein distance as a sum rather than an integral,\nand use the matrix notation commonly encountered in the optimal transport literature [34] to represent\nthe transportation plan. Given two sets of vectors X 2 Rn\u21e5m and X0 2 Rn0\u21e5m, we can equivalently\nde\ufb01ne the Wasserstein distance between them as\n(2)\n\nW1(X, X0) := min\n\nP2(X,X0)hP, Mi .\n\nHere, M is the distance matrix containing the distances d(x, x0) between each element x of X and x0\nof X0, P 2 is a transport matrix (or joint probability), and h\u00b7,\u00b7i is the Frobenius dot product. The\ntransport matrix P contains the fractions that indicate how to transport the values from X to X0 with\nthe minimal total transport effort. Because we assume that the total mass to be transported equals 1\nand is evenly distributed across the elements of X and X0, the row and column values of P must sum\nup to 1/n and 1/n0, respectively.\n\n3 Wasserstein distance on graphs\nThe unsatisfactory nature of the aggregation step of current R-Convolution graph kernels, which may\nmask important substructure differences by averaging, motivated us to have a \ufb01ner distance measure\nbetween structures and their components. In parallel, recent advances in optimisation solutions for\nfaster computation of the optimal transport problem inspired us to consider this framework for the\nproblem of graph classi\ufb01cation. Our method relies on the following steps: (1) transform each graph\ninto a set of node embeddings, (2) measure the Wasserstein distance between each pair of graphs, and\n(3) compute a similarity matrix to be used in the learning algorithm. Figure 1 illustrates the \ufb01rst two\nsteps, and Algorithm 1 summarises the whole procedure. We start by de\ufb01ning an embedding scheme\nand illustrate how we integrate embeddings in the Wasserstein distance.\nDe\ufb01nition 2 (Graph Embedding Scheme). Given a graph G = (V, E), a graph embedding scheme\nf : G ! R|V |\u21e5m, f (G) = XG is a function that outputs a \ufb01xed-size vectorial representation for\neach node in the graph. For each vi 2 V , the i-th row of XG is called the node embedding of vi.\nNote that De\ufb01nition 2 permits treating node labels, which are categorical attributes, as one-\ndimensional attributes with m = 1.\nDe\ufb01nition 3 (Graph Wasserstein Distance). Given two graphs G = (V, E) and G0 = (V 0, E0), a\ngraph embedding scheme f : G ! R|V |\u21e5m and a ground distance d : Rm \u21e5 Rm ! R, we de\ufb01ne the\nGraph Wasserstein Distance (GWD) as\n(3)\n\nW (G, G0) := W1(f (G), f (G0)).\n\nDf\n\n3\n\n\fFigure 1: Visual summary of the graph Wasserstein distance. First, f generates embeddings for\ntwo input graphs G and G0. Then, the Wasserstein distance between the embedding distributions is\ncomputed.\nWe will now propose a graph embedding scheme inspired by the WL kernel on categorically labeled\ngraphs, extend it to continuously attributed graphs with weighted edges, and show how to integrate it\nwith the GWD presented in De\ufb01nition 3.\n\n3.1 Generating node embeddings\nThe Weisfeiler\u2013Lehman scheme. The Weisfeiler\u2013Lehman subtree kernel [37, 38], designed for\nlabelled non-attributed graphs, looks at similarities among subtree patterns, de\ufb01ned by a propagation\nscheme on the graphs that iteratively compares labels on the nodes and their neighbours. This is\nachieved by creating a sequence of ordered strings through the aggregation of the labels of a node\nand its neighbours; those strings are subsequently hashed to create updated compressed node labels.\nWith increasing iterations of the algorithm, these labels represent increasingly larger neighbourhoods\nof each node, allowing to compare more extended substructures.\nSpeci\ufb01cally, consider a graph G = (V, E), let `0(v) = `(v) be the initial node label of v for each\nv 2 V , and let H be the number of WL iterations. Then, we can de\ufb01ne a recursive scheme to\ncompute `h(v) for h = 1, . . . , H by looking at the ordered set of neighbours labels N h(v) =\n{`h(u0), . . . ,` h(udeg(v)1)} as\n\n`h+1(v) = hash(`h(v),N h(v)).\n\nWe call this procedure the WL labelling scheme. As in the original publication [37], we use perfect\nhashing for the hash function, so nodes at iteration h + 1 will have the same label if and only if their\nlabel and those of their neighbours are identical at iteration h.\nExtension to continuous attributes. For graphs with continuous attributes a(v) 2 Rm, we need to\nimprove the WL re\ufb01nement step, whose original de\ufb01nition prohibits handling the continuous case.\nThe key idea is to create an explicit propagation scheme that leverages and updates the current node\nfeatures by averaging over the neighbourhoods. Although similar approaches have been implicitly\ninvestigated for computing node-level kernel similarities [27, 28], they rely on additional hashing\nsteps for the continuous features. Moreover, we can easily account for edge weights by considering\nthem in the average calculation of each neighbourhood. Suppose we have a continuous attribute\na0(v) = a(v) for each node v 2 G. Then, we recursively de\ufb01ne\n\nah+1(v) =\n\n1\n\n20@ah(v) +\n\n1\n\ndeg(v) Xu2N (v)\n\nw ((v, u)) \u00b7 ah(u)1A .\n\n(4)\n\n(5)\n\nWhen edge weights are not available, we set w(u, v) = 1. We consider the weighted average of the\nneighbourhood attribute values instead of a sum and add the 1/2 factor because we want to ensure\na similar scale of the features across iterations; in fact, we concatenate such features for building\nour proposed kernel (see De\ufb01nition 4 for more details) and observe better empirical results with\nsimilarly scaled features. Although this is not a test of graph isomorphism, this re\ufb01nement step can be\nseen as an intuitive extension for continuous attributes of the one used by the WL subtree kernel on\ncategorical node labels, a widely successful baseline. Moreover, it resembles the propagation scheme\nused in many graph neural networks, which have proven to be successful for node classi\ufb01cation on\nlarge data sets [9, 21, 22]. Finally, its ability to account for edge weights makes it applicable to all\n\n4\n\n\ftypes of graphs without having to perform a hashing step [27]. Further extensions of the re\ufb01nement\nstep to account for high-dimensional edge attributes are left for future work. A straightforward\nexample would be to also apply the scheme on the dual graph (where each edge is represented as a\nnode, and connectivity is established if two edges in the primal graph share the same node) to then\ncombine the obtained kernel with the kernel obtained on primal graphs via appropriate weighting.\nGraph embedding scheme. Using the recursive procedure described above, we propose a WL-based\ngraph embedding scheme that generates node embeddings from the node labels or attributes of the\ngraphs. In the following, we use m to denote the dimensionality of the node attributes (m = 1 for the\ncategorical labels).\nDe\ufb01nition 4 (WL features). Let G = (V, E) and let H be the number of WL iterations. Then, for\nevery h 2{ 0, . . . , H}, we de\ufb01ne the WL features as\n\n(6)\nwhere xh(\u00b7) = `h(\u00b7) for categorically labelled graphs and xh(\u00b7) = ah(\u00b7) for continuously attributed\ngraphs. We refer to X h\nG 2 RnG\u21e5m as the node features of graph G at iteration h. Then, the node\nembeddings of graph G at iteration H are de\ufb01ned as\n\nG = [xh(v1), . . . , xh(vnG)]T ,\n\nX h\n\nf H : G ! RnG\u21e5(m(H+1))\nG 7! concatenate(X 0\n\nG, . . . , X H\n\nG ).\n\n(7)\n\nWe observe that a graph can be both categorically labelled and continuously attributed, and one could\nextend the above scheme by jointly considering this information (for instance, by concatenating the\nnode features). However, we will leave this scenario as an extension for future work; thereby, we\navoid having to de\ufb01ne an appropriate distance measure between categorical and continuous data, as\nthis is a long-standing issue [40].\n\n3.2 Computing the Wasserstein distance\nOnce the node embeddings are generated by the graph embedding scheme, we evaluate the pairwise\nWasserstein distance between graphs. We start by computing the ground distances between each pair\nof nodes. For categorical node features, we use the normalised Hamming distance:\n\ndHam(v, v0) =\n\n1\n\nH + 1\n\nH+1Xi=1\n\n\u21e2(vi, v0i),\u21e2 (x, y) =\u21e2 1, x 6= y\n\n0, x = y\n\n(8)\n\nThe Hamming distance can be pictured as the normalised sum of discrete metric \u21e2 on each of the\nfeatures. The Hamming distance equals 1 when two vectors have no features in common and 0 when\nthe vectors are identical. We use the Hamming distance as, in this case, the Weisfeiler\u2013Lehman\nfeatures are indeed categorical, and values carry no meaning. For continuous node features, on the\nother hand, we employ the Euclidean distance:\n\ndE(v, v0) = ||v v0||2.\n\n(9)\n\nNext, we substitute the ground distance into the equation of De\ufb01nition 1 and compute the Wasserstein\ndistance using a network simplex method [31].\nComputational complexity. Naively, the computation of the Wasserstein Distance has a complexity\nof O(n3log(n)), with n being the cardinality of the indexed set of node embeddings, i.e., the number\nof nodes in the two graphs. Nevertheless, ef\ufb01cient speedup tricks can be employed. In particular,\napproximations relying on Sinkhorn regularisation have been proposed [8], some of which reduce the\ncomputational burden to near-linear time while preserving accuracy [1]. Such speedup strategies\nbecome incredibly useful for larger data sets, i.e., graphs with thousands of nodes, and can be easily\nintegrated into our method. See Appendix A.7 for a practical discussion.\n\n4 From Wasserstein distance to kernels\n\nFrom the graph Wasserstein distance, one can construct a similarity measure to be used in a learning\nalgorithm. In this section, we propose a new graph kernel, state some claims about its (in)de\ufb01niteness,\nand elaborate on how to use it for classifying graphs with continuous and categorical node labels.\n\n5\n\n\fAlgorithm 1 Compute Wasserstein graph kernel\n\nInput: Two graphs G1, G2; graph embedding scheme f H; ground distance d; .\nOutput: kernel value kW W L(G1, G2).\nXG1 f H(G1); XG2 f H(G2) // Generate node embeddings\nD pairwise_dist(XG1, XG2, d) // Compute the ground distance between each pair of nodes\nDW (G1, G2) = minP2 hP, Di // Compute the Wasserstein distance\nkW (G1, G2) eDW (G1,G2)\n\nTable 1: Classi\ufb01cation accuracies on graphs with categorical node labels. Comparison of Weisfeiler\u2013\nLehman kernel (WL), optimal assignment kernel (WL-OA), and our method (WWL).\n\nMETHOD MUTAG\nV\n85.39\u00b10.73\n84.17\u00b11.44\nE\n85.78\u00b10.83\nWL\nWL-OA 87.15\u00b11.82\n87.27\u00b11.50\nWWL\n\nPTC-MR\n58.35\u00b10.20\n55.82\u00b10.00\n61.21\u00b12.28\n60.58\u00b11.35\n66.31\u00b11.21\u21e4\n\nNCI1\n\n64.22\u00b10.11\n63.57\u00b10.12\n85.83\u00b10.09\n86.08\u00b10.27\n85.75\u00b10.25\n\nPROTEINS\n72.12\u00b10.19\n72.18\u00b10.42\n74.99\u00b10.28\n76.37\u00b10.30\u21e4\n74.28\u00b10.56\n\nD&D\n\n78.24\u00b10.28\n75.49\u00b10.21\n78.29\u00b10.30\n79.15\u00b10.33\n79.69\u00b10.50\n\nENZYMES\n22.72\u00b10.56\n21.87\u00b10.64\n53.33\u00b10.93\n58.97\u00b10.82\n59.13\u00b10.80\n\nDe\ufb01nition 5 (Wasserstein Weisfeiler\u2013Lehman). Given a set of graphs G = {G1, . . . , GN} and the\nGWD de\ufb01ned for each pair of graph on their WL embeddings, we de\ufb01ne the Wasserstein Weisfeiler\u2013\nLehman (WWL) kernel as\n\nKWWL = eDfWL\nW .\n\n(10)\n\nThis is an instance of a Laplacian kernel, which was shown to offer favourable conditions for positive\nde\ufb01niteness in the case of non-Euclidean distances [11]. Obtaining the WWL kernel concludes\nthe procedure described in Algorithm 1. In the remainder of this section, we distinguish between\nthe categorical WWL kernel, obtained on graphs with categorical labels, and the continuous WWL\nkernel, obtained on continuously attributed graphs via the graph embedding schemes described in\nSection 3.1.\nFor Euclidean spaces, obtaining positive de\ufb01nite kernels from distance functions is a well-studied\ntopic [17]. However, the Wasserstein distance in its general form is not isometric, i.e., there is no\nmetric-preserving mapping to an L2-norm, as the metric space it induces strongly depends on the\nchosen ground distance [12]. Therefore, despite being a metric, it is not necessarily possible to\nderive a positive de\ufb01nite kernel from the Wasserstein distance in its general formulation, because\nthe classical approaches [17] cannot be applied here. Nevertheless, as a consequence of using the\nLaplacian kernel [11], we can show that, in the setting of categorical node labels, the obtained kernel\nis positive de\ufb01nite.\nTheorem 1. The categorical WWL kernel is positive de\ufb01nite for all > 0.\n\nFor a proof, see Sections A.1 and A.1.1 in the Appendix. By contrast, for the continuous case,\nestablishing the de\ufb01niteness of the obtained kernel remains an open problem. We refer the reader to\nSection A.1.2 in the supplementary materials for further discussions and conjectures.\nTherefore, to ensure the theoretical and practical correctness of our results in the continuous case, we\nemploy recently developed methods for learning with inde\ufb01nite kernels. Speci\ufb01cally, we use learning\nmethods for Kre\u02d8\u0131n spaces, which have been speci\ufb01cally designed to work with inde\ufb01nite kernels [30];\nin general, kernels that are not positive de\ufb01nite induce reproducing kernel Kre\u02d8\u0131n spaces (RKKS).\nThese spaces can be seen as a generalisation of reproducing kernel Hilbert spaces, with which they\nshare similar mathematical properties, making them amenable to machine learning techniques. Recent\nalgorithms [26, 29] are capable of solving learning problems in RKKS; their results indicate that there\nare clear bene\ufb01ts (in terms of classi\ufb01cation performance, for example) of learning in such spaces.\nTherefore, when evaluating WWL, we will use a Kre\u02d8\u0131n SVM (KSVM, [26]) as a classi\ufb01er for the\ncase of continuous attributes.\n\n6\n\n\fTable 2: Classi\ufb01cation accuracies on graphs with continuous node and/or edge attributes. Comparison\nof hash graph kernel (HGK-WL, HGK-SP), GraphHopper kernel (GH), and our method (WWL).\n\nBZR\n\nIMDB-B\nMETHOD ENZYMES PROTEINS\nBZR-MD COX2-MD\n47.15\u00b10.79 60.79\u00b10.12 71.64\u00b10.49\nVH-C\n74.82\u00b12.13\n48.51\u00b10.63 66.58\u00b10.97 64.89\u00b11.06\n75.45\u00b11.53 69.13\u00b11.27 71.83\u00b11.61\nRBF-WL 68.43\u00b11.47 75.43\u00b10.28 72.06\u00b10.34\n80.96\u00b11.67\n78.59\u00b10.63 78.13\u00b10.45 68.94\u00b10.65 74.61\u00b11.74\nHGK-WL 63.04\u00b10.65 75.93\u00b10.17 73.12\u00b10.40\nHGK-SP 66.36\u00b10.37 75.78\u00b10.17 73.06\u00b10.27\n76.42\u00b10.72\n72.57\u00b11.18 66.17\u00b11.05 68.52\u00b11.00\n76.41\u00b11.39 69.14\u00b12.08 66.20\u00b11.05\nGH\n65.65\u00b10.80 74.78\u00b10.29 72.35\u00b10.55\n76.49\u00b10.99\n73.25\u00b10.87\u21e4 77.91\u00b10.80\u21e4 74.37\u00b10.83\u21e4 84.42\u00b12.03\u21e4 78.29\u00b10.47 69.76\u00b10.94 76.33\u00b11.02\nWWL\n\nCOX2\n\n5 Experimental evaluation\n\nIn this section, we analyse how the performance of WWL compares with state-of-the-art graph\nkernels. In particular, we empirically observe that WWL (1) is competitive with the best graph kernel\nfor categorically labelled data, and (2) outperforms all the state-of-the-art graph kernels for attributed\ngraphs.\n\n5.1 Data sets\n\nWe report results on real-world data sets from multiple sources [6, 38, 45] and use either their\ncontinuous attributes or categorical labels for evaluation. In particular, MUTAG, PTC-MR, NCI1,\nand D&D are equipped with categorical node labels only; ENZYMES and PROTEINS have both\ncategorical labels and continuous attributes; IMDB-B, BZR, and COX2 only contain continuous\nattributes; \ufb01nally, BZR-MD and COX2-MD have both continuous node attributes and edge weights.\nFurther information on the data sets is available in Supplementary Table A.1. Additionally, we report\nresults on synthetic data (SYNTHIE and SYNTHETIC-NEW) in Appendix A.5. All the data sets\nhave been downloaded from Kersting et al. [20].\n\n5.2 Experimental setup\n\nWe compare WWL with state-of-the-art graph kernel methods from the literature and relevant\nbaselines, which we trained ourselves on the same splits (see below). In particular, for the categorical\ncase, we compare with WL [37] and WL-OA [25] as well as with the vertex (V) and edge (E)\nhistograms. Because [25] already showed that the WL-OA is superior to previous approaches, we\ndo not include the whole set of kernels in our comparison. For the continuously attributed data sets,\nwe compare with two instances of the hash graph kernel (HGK-SP; HGK-WL) [27] and with the\nGraphHopper (GH) [10]. For comparison, we additionally use a continuous vertex histogram (VH-\nC), which is de\ufb01ned as a radial basis function (RBF) kernel between the sum of the graph node\nembeddings. Furthermore, to highlight the bene\ufb01ts of using the Wasserstein distance in our method,\nwe replace it with an RBF kernel. Speci\ufb01cally, given two graphs G1 = (V1, E1) and G2 = (V2, E2),\nwith |V1| = n1 and |V2| = n2, we \ufb01rst compute the Gaussian kernel between each pair of the\nnode embeddings obtained in the same fashion as for WWL; therefore, we obtain a kernel matrix\nbetween node embeddings K0 2 n1 \u21e5 n2. Next, we sum up the values Ks =Pn1\nj=1 K0i,j and\nset K(G1, G2) = Ks. This procedure is repeated for each pair of graphs to obtain the \ufb01nal graph\nkernel matrix. We refer to this baseline as RBF-WL.\nAs a classi\ufb01er, we use an SVM (or a KSVM in the case of WWL) and 10-fold cross-validation,\nselecting the parameters on the training set only. We repeat each cross-validation split 10 times\nand report the average accuracy. We employ the same split for each evaluated method, thereby\nguaranteeing a fully comparable setup among all evaluated methods. Please refer to Appendix A.6\nfor details on the hyperparameter selection.\n\ni=1Pn2\n\nImplementation and computing infrastructure Available Python implementations can be used\nto compute the WL kernel [41] and the Wasserstein distance [13]. We leverage these resources and\n\n7\n\n\fmake our code publicly available1. We use the original implementations provided by the respective\nauthors to compute the WL-OA, HGK, and GH methods. All our analyses were performed on a\nshared server running Ubuntu 14.04.5 LTS, with 4 CPUs (Intel Xeon E7-4860 v2 @ 2.60GHz) each\nwith 12 cores and 24 threads, and 512 GB of RAM.\n\n5.3 Results and discussion\nThe results are evaluated by classi\ufb01cation accuracy and summarised in Table 1 and Table 2 for the\ncategorical labels and continuous attributes, respectively2.\n\n5.3.1 Categorical labels\nOn the categorical data sets, WWL is comparable to the WL-OA kernel; however, it improves over\nthe classical WL. In particular, WWL largely improves over WL-OA in PTC-MR and is slightly\nbetter on D&D, whereas WL-OA is better on NCI1 and PROTEINS.\nUnsurprisingly, our approach is comparable to the WL-OA, whose main idea is to solve the optimal\nassignment problem by de\ufb01ning Dirac kernels on histograms of node labels, using multiple iterations\nof WL. This formulation is similar to the one we provide for categorical data, but it relies on the\noptimal assignment rather than the optimal transport; therefore, it requires one-to-one mappings\ninstead of continuous transport maps. Besides, we solve the optimal transport problem on the con-\ncatenated embeddings, hereby jointly exploiting representations at multiple WL iterations. Contrarily,\nthe WL-OA performs an optimal assignment at each iteration of WL and only combines them in\nthe second stage. However, the key advantage of WWL over WL-OA is its capacity to account for\ncontinuous attributes.\n\n5.3.2 Continuous attributes\nIn this setting, WWL signi\ufb01cantly outperforms the other methods on 4 out of 7 data sets, is better\non another one, and is on a par on the remaining 2. We further compute the average rank of each\nmethod in the continuous setting, with WWL scoring as \ufb01rst. The ranks calculated from Table 2\nare WWL = 1, HGK-WL = 2.86, RBF-WL = 3.29, HGK-SP = 4.14, and VH-C = 5.86. This is a\nremarkable improvement over the current state of the art, and it indeed establishes a new one. When\nlooking at the average rank of the method, WWL always scores \ufb01rst. Therefore, we raise the bar\nin kernel graph classi\ufb01cation for attributed graphs. As mentioned in Section 4, the kernel obtained\nfrom continuous attributes is not necessarily positive de\ufb01nite. However, we empirically observe the\nkernel matrices to be positive de\ufb01nite (up to a numerical error), further supporting our theoretical\nconsiderations (see Appendix A.1). In practice, the difference between the results obtained from\nclassical SVMs in RKHS and the results obtained with the KSVM approach is negligible.\n\nComparison with hash graph kernels\nThe hash graph kernel (HGK) approach is somewhat\nrelated to our propagation scheme. By using multiple hashing functions, the HGK method is capable\nof extending certain existing graph kernels to the continuous setting. This helps to avoid the limitations\nof perfect hashing, which cannot express small differences in continuous attributes. A drawback\nof the random hashing performed by HGK is that it requires additional parameters and introduces\na stochastic element to the kernel matrix computation. By contrast, our propagation scheme is\nfully continuous and uses the Wasserstein distance to capture small differences in distributions\nof continuous node attributes. Moreover, the observed performance gap suggests that an entirely\ncontinuous representation of the graphs provides clear bene\ufb01ts over the hashing.\n\n6 Conclusion\n\nIn this paper, we present a new family of graph kernels, the Wasserstein Weisfeiler\u2013Lehman (WWL)\ngraph kernels. Our experiments show that WWL graph kernels outperform the state of the art for\n\n1https://github.com/BorgwardtLab/WWL\n2 The best performing methods up to the resolution implied by the standard deviation across repetitions are\nhighlighted in boldface. Additionally, to evaluate signi\ufb01cance we perform 2-sample t-tests with a signi\ufb01cance\nthreshold of 0.05 and Bonferroni correction for multiple hypothesis testing within each data set, signi\ufb01cantly\noutperforming methods are denoted by an asterisk.\n\n8\n\n\fgraph classi\ufb01cation in the scenario of continuous node attributes, while matching the state of the\nart in the categorical setting. As a line of research for future work, we see great potential in the\nruntime improvement, thus, allowing applications of our method on regimes and data sets with larger\ngraphs. In fact, preliminary experiments (see Section A.7 as well as Figure A.1 in the Appendix)\nalready con\ufb01rm the bene\ufb01t of Sinkhorn regularisation when the average number of nodes in the\ngraph increases. In parallel, it would be bene\ufb01cial to derive approximations of the explicit feature\nrepresentations in the RKKS, as this would also provide a consistent speedup. We further envision that\nmajor theoretical contributions could be made by de\ufb01ning theoretical bounds to ensure the positive\nde\ufb01niteness of the WWL kernel in the case of continuous node attributes. Finally, optimisation\nobjectives based on optimal transport could be employed to develop new algorithms based on graph\nneural networks [9, 21]. On a more general level, our proposed method provides a solid foundation\nof the use of optimal transport theory for kernel methods and highlights the large potential of optimal\ntransport for machine learning.\n\nAcknowledgments\nThis work was funded in part by the Horizon 2020 project CDS-QUAMRI, Grant No. 634541 (E.G.,\nK.B.), the Alfried Krupp Prize for Young University Teachers of the Alfried Krupp von Bohlen\nund Halbach-Stiftung (B.R., K.B.), and the SNSF Starting Grant \u201cSigni\ufb01cant Pattern Mining\u201d (F.L.,\nK.B.).\n\nReferences\n[1] J. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal\ntransport via sinkhorn iteration. In Advances in Neural Information Processing Systems 30,\npages 1964\u20131974, 2017.\n\n[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[3] M.-F. Balcan, A. Blum, and N. Srebro. A theory of learning with similarity functions. Machine\n\nLearning, 72(1-2):89\u2013112, 2008.\n\n[4] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups. Springer,\n\nHeidelberg, Germany, 1984.\n\n[5] K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In Proceedings of the\n\nFifth IEEE International Conference on Data Mining, pages 74\u201381, 2005.\n\n[6] K. M. Borgwardt, C. S. Ong, S. Sch\u00f6nauer, S. Vishwanathan, A. J. Smola, and H.-P. Kriegel.\n\nProtein function prediction via graph kernels. Bioinformatics, 21:i47\u2013i56, 2005.\n\n[7] M. R. Bridson and A. H\u00e4\ufb02iger. Metric spaces of non-positive curvature. Springer, Heidelberg,\n\nGermany, 2013.\n\n[8] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in\n\nNeural Information Processing Systems 26, pages 2292\u20132300, 2013.\n\n[9] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik,\nand R. P. Adams. Convolutional networks on graphs for learning molecular \ufb01ngerprints. In\nAdvances in Neural Information Processing Systems 28, pages 2224\u20132232, 2015.\n\n[10] A. Feragen, N. Kasenburg, J. Petersen, M. de Bruijne, and K. Borgwardt. Scalable kernels for\ngraphs with continuous attributes. In Advances in Neural Information Processing Systems 26,\npages 216\u2013224, 2013.\n\n[11] A. Feragen, F. Lauze, and S. Hauberg. Geodesic exponential kernels: When curvature and\nlinearity con\ufb02ict. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3032\u20133042, 2015.\n\n[12] A. Figalli and C. Villani. Optimal transport and curvature. In Nonlinear PDE\u2019s and Applications,\n\npages 171\u2013217. Springer, Heidelberg, Germany, 2011.\n\n9\n\n\f[13] R. Flamary and N. Courty. POT: Python Optimal Transport library, 2017. URL https:\n\n//github.com/rflamary/POT.\n\n[14] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio. Learning with a Wasserstein\n\nloss. In Advances in Neural Information Processing Systems 28, pages 2053\u20132061, 2015.\n\n[15] H. Fr\u00f6hlich, J. K. Wegner, F. Sieker, and A. Zell. Optimal assignment kernels for attributed\nmolecular graphs. In Proceedings of the 22nd International Conference on Machine Learning,\npages 225\u2013232, 2005.\n\n[16] A. Gardner, C. A. Duncan, J. Kanno, and R. R. Selmic. On the de\ufb01niteness of Earth Mover\u2019s\n\nDistance and its relation to set intersection. IEEE Transactions on Cybernetics, 2017.\n\n[17] B. Haasdonk and C. Bahlmann. Learning with distance substitution kernels.\n\nSymposium, 2004.\n\nIn DAGM-\n\n[18] D. Haussler. Convolution kernels on discrete structures. Technical report, Department of\n\nComputer Science, University of California, 1999.\n\n[19] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In\nProceedings of the 20th International Conference on Machine Learning, pages 321\u2013328, 2003.\n[20] K. Kersting, N. M. Kriege, C. Morris, P. Mutzel, and M. Neumann. Benchmark data sets for\n\ngraph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.de.\n\n[21] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nIn 5th International Conference on Learning Representations, 2017.\n\n[22] J. Klicpera, A. Bojchevski, and S. G\u00fcnnemann. Combining neural networks with person-\nalized pagerank for classi\ufb01cation on graphs. In 7th International Conference on Learning\nRepresentations, 2019.\n\n[23] S. Kolouri, Y. Zou, and G. K. Rohde. Sliced Wasserstein kernels for probability distributions.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n5258\u20135267, 2016.\n\n[24] N. Kriege and P. Mutzel. Subgraph matching kernels for attributed graphs. In Proceedings of\n\nthe 29th International Conference on Machine Learning, pages 1015\u20131022, 2012.\n\n[25] N. M. Kriege, P.-L. Giscard, and R. C. Wilson. On valid optimal assignment kernels and\napplications to graph classi\ufb01cation. In Advances in Neural Information Processing Systems 29,\npages 1623\u20131631, 2016.\n\n[26] G. Loosli, S. Canu, and C. S. Ong. Learning SVM in Kre\u02d8\u0131n spaces. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 38(6):1204\u20131216, 2015.\n\n[27] C. Morris, N. M. Kriege, K. Kersting, and P. Mutzel. Faster kernels for graphs with continuous\nattributes via hashing. In Proceedings of the 16th IEEE International Conference on Data\nMining, pages 1095\u20131100, 2016.\n\n[28] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting. Propagation kernels: ef\ufb01cient graph\n\nkernels from propagated information. Machine Learning, 102(2):209\u2013245, 2016.\n\n[29] D. Oglic and T. G\u00e4rtner. Learning in reproducing kernel kre\u0131n spaces. In Proceedings of the\n\n35th International Conference on Machine Learning, pages 3859\u20133867, 2018.\n\n[30] C. S. Ong, X. Mary, S. Canu, and A. J. Smola. Learning with non-positive kernels.\n\nProceedings of the 21st International Conference on Machine Learning, 2004.\n\nIn\n\n[31] G. Peyr\u00e9, M. Cuturi, et al. Computational optimal transport. Foundations and Trends R in\n\nMachine Learning, 11(5-6):355\u2013607, 2019.\n\n[32] J. Rabin, G. Peyr\u00e9, J. Delon, and M. Bernot. Wasserstein barycenter and its application to texture\nmixing. In International Conference on Scale Space and Variational Methods in Computer\nVision, pages 435\u2013446, 2011.\n\n10\n\n\f[33] B. Rieck, C. Bock, and K. Borgwardt. A persistent Weisfeiler\u2013Lehman procedure for graph\nclassi\ufb01cation. In Proceedings of the 36th International Conference on Machine Learning, pages\n5448\u20135458, 2019.\n\n[34] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover\u2019s Distance as a metric for image\n\nretrieval. International Journal of Computer Vision, 40(2):99\u2013121, 2000.\n\n[35] B. Sch\u00f6lkopf. The kernel trick for distances. In Advances in Neural Information Processing\n\nSystems 13, pages 301\u2013307, 2001.\n\n[36] B. Sch\u00f6lkopf and A. J. Smola. Learning with kernels: support vector machines, regularization,\n\noptimization, and beyond. MIT press, 2002.\n\n[37] N. Shervashidze and K. M. Borgwardt. Fast subtree kernels on graphs. In Advances in Neural\n\nInformation Processing Systems 22, pages 1660\u20131668, 2009.\n\n[38] N. Shervashidze, P. Schweitzer, E. J. v. Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-\n\nLehman graph kernels. Journal of Machine Learning Research, 12:2539\u20132561, 2011.\n\n[39] O. Shin-Ichi. Barycenters in Alexandrov spaces of curvature bounded below. Advances in\n\nGeometry, 14(4):571\u2013587, 2012.\n\n[40] S. S. Stevens. On the theory of scales of measurement. Science, 103(2684):677\u2013680, 1946.\n[41] M. Sugiyama, M. E. Ghisu, F. Llinares-L\u00f3pez, and K. Borgwardt. graphkernels: R and python\n\npackages for graph comparison. Bioinformatics, 34(3):530\u2013532, 2018.\n\n[42] K. Turner, Y. Mileyko, S. Mukherjee, and J. Harer. Fr\u00e9chet means for distributions of persistence\n\ndiagrams. Discrete & Computational Geometry, 52:44\u201370, 2014.\n\n[43] J.-P. Vert. The optimal assignment kernel is not positive de\ufb01nite. arXiv preprint arXiv:0801.4061,\n\n2008.\n\n[44] C. Villani. Optimal transport: old and new, volume 338. Springer, Heidelberg, Germany, 2008.\n[45] S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt. Graph kernels.\n\nJournal of Machine Learning Research, 11:1201\u20131242, 2010.\n\n[46] H. Xu, D. Luo, H. Zha, and L. C. Duke. Gromov\u2013Wasserstein learning for graph matching and\nnode embedding. In Proceedings of the 36th International Conference on Machine Learning,\npages 6932\u20136941, 2019.\n\n[47] P. Yanardag and S. Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 1365\u20131374, 2015.\n\n11\n\n\f", "award": [], "sourceid": 3470, "authors": [{"given_name": "Matteo", "family_name": "Togninalli", "institution": "ETH Z\u00fcrich"}, {"given_name": "Elisabetta", "family_name": "Ghisu", "institution": "ETH Zurich"}, {"given_name": "Felipe", "family_name": "Llinares-L\u00f3pez", "institution": "ETH Z\u00fcrich"}, {"given_name": "Bastian", "family_name": "Rieck", "institution": "ETH Zurich"}, {"given_name": "Karsten", "family_name": "Borgwardt", "institution": "ETH Zurich"}]}