{"title": "Fast subtree kernels on graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 1660, "page_last": 1668, "abstract": "", "full_text": "Fast subtree kernels on graphs\n\nNino Shervashidze, Karsten M. Borgwardt\n\nInterdepartmental Bioinformatics Group\nMax Planck Institutes T\u00a8ubingen, Germany\n\n{nino.shervashidze,karsten.borgwardt}@tuebingen.mpg.de\n\nAbstract\n\nIn this article, we propose fast subtree kernels on graphs. On graphs with n nodes\nand m edges and maximum degree d, these kernels comparing subtrees of height\nh can be computed in O(mh), whereas the classic subtree kernel by Ramon &\nG\u00a8artner scales as O(n24dh). Key to this ef\ufb01ciency is the observation that the\nWeisfeiler-Lehman test of isomorphism from graph theory elegantly computes a\nsubtree kernel as a byproduct. Our fast subtree kernels can deal with labeled\ngraphs, scale up easily to large graphs and outperform state-of-the-art graph ker-\nnels on several classi\ufb01cation benchmark datasets in terms of accuracy and runtime.\n\n1 Introduction\n\nGraph kernels have recently evolved into a branch of kernel machines that reaches deep into graph\nmining. Several different graph kernels have been de\ufb01ned in machine learning which can be catego-\nrized into three classes: graph kernels based on walks [5, 7] and paths [2], graph kernels based on\nlimited-size subgraphs [6, 11], and graph kernels based on subtree patterns [9, 10].\nWhile fast computation techniques have been developed for graph kernels based on walks [12]\nand on limited-size subgraphs [11], it is unclear how to compute subtree kernels ef\ufb01ciently. As a\nconsequence, they have been applied to relatively small graphs representing chemical compounds [9]\nor handwritten digits [1], with approximately twenty nodes on average. But could one speed up\nsubtree kernels to make them usable on graphs with hundreds of nodes, as they arise in protein\nstructure models or in program \ufb02ow graphs?\nIt is a general limitation of graph kernels that they scale poorly to large, labeled graphs with more\nthan 100 nodes. While the ef\ufb01cient kernel computation strategies from [11, 12] are able to compare\nunlabeled graphs ef\ufb01ciently, the ef\ufb01cient comparison of large, labeled graphs remains an unsolved\nchallenge. Could one speed up subtree kernels to make them the kernel of choice for comparing\nlarge, labeled graphs?\nThe goal of this article is to address both of the aforementioned questions, that is, to develop a fast\nsubtree kernel that scales up to large, labeled graphs.\nThe remainder of this article is structured as follows. In Section 2, we review the subtree kernel from\nthe literature and its runtime complexity. In Section 3, we describe an alternative subtree kernel and\nits ef\ufb01cient computation based on the Weisfeiler-Lehman test of isomorphism. In Section 4, we\ncompare these two subtree kernels to each other, as well as to a set of four other state-of-the-art\ngraph kernels and report results on kernel computation runtime and classi\ufb01cation accuracy on graph\nbenchmark datasets.\n\n1\n\n\f2 The Ramon-G\u00a8artner subtree kernel\nTerminology We de\ufb01ne a graph G as a triplet (V, E,L), where V is the set of vertices, E the set\nof undirected edges, and L : V \u2192 \u03a3 a function that assigns labels from an alphabet \u03a3 to nodes in\nthe graph1. The neighbourhood N (v) of a node v is the set of nodes to which v is connected by an\nedge, that is N (v) = {v(cid:48)|(v, v(cid:48)) \u2208 E}. For simplicity, we assume that every graph has n nodes, m\nedges, a maximum degree of d, and that there are N graphs in our given set of graphs.\nA walk is a sequence of nodes in a graph, in which consecutive nodes are connected by an edge. A\npath is a walk that consists of distinct nodes only. A (rooted) subtree is a subgraph of a graph, which\nhas no cycles, but a designated root node. A subtree of G can thus be seen as a connected subset\nof distinct nodes of G with an underlying tree structure. The height of a subtree is the maximum\ndistance between the root and any other node in the graph plus one. The notion of walk is extending\nthe notion of path by allowing nodes to be equal. Similarly, the notion of subtrees can be extended to\nsubtree patterns (also called \u2018tree-walks\u2019 [1]), which can have nodes that are equal. These repetitions\nof the same node are then treated as distinct nodes, such that the pattern is still a cycle-free tree. Note\nthat all subtree kernels compare subtree patterns in two graphs, not (strict) subtrees. Let S(G) refer\nto the set of all subtree patterns in graph G.\n\nDe\ufb01nition The \ufb01rst subtree kernel on graphs was de\ufb01ned by [10]. It compares all pairs of nodes\nfrom graphs G = (V, E,L) and G(cid:48) = (V (cid:48), E(cid:48),L(cid:48)) by iteratively comparing their neighbourhoods:\n\nk(h)\n\n(cid:88)\n\nkh(v, v(cid:48)),\n\nRamon(G, G(cid:48)) =(cid:88)\n(cid:81)\n(cid:80)\n\u03b4(L(v),L(cid:48)(v(cid:48))),\n(w,w(cid:48))\u2208R kh\u22121(w, w(cid:48)),\n\nv\u2208V\n\nv(cid:48)\u2208V (cid:48)\n\n\u03bbr\u03bbs\n\nR\u2208M(v,v(cid:48))\n\nif h = 1\nif h > 1\n\n(cid:26)\n\nkh(v, v(cid:48)) =\n\nwhere\n\nand\n\n(1)\n\n(2)\n\n(3)\n\nM(v, v(cid:48)) = {R \u2286 N (v) \u00d7 N (v(cid:48))|(\u2200(u, u(cid:48)), (w, w(cid:48)) \u2208 R : u = w \u21d4 u(cid:48) = w(cid:48))\n\u2227(\u2200(u, u(cid:48)) \u2208 R : L(u) = L(cid:48)(u(cid:48)))}.\n\nIntuitively, kRamon iteratively compares all matchings M(v, v(cid:48)) between neighbours of two nodes\nv from G and v(cid:48) from G(cid:48).\n\nComplexity The runtime complexity of the subtree kernel for a pair of graphs is O(n2h4d), in-\ncluding a comparison of all pairs of nodes (n2), and a pairwise comparison of all matchings in\ntheir neighbourhoods in O(4d), which is repeated in h iterations. h is a multiplicative factor, not an\nexponent, as one can implement the subtree kernel recursively, starting with k1 and recursively com-\nputing kh from kh\u22121. For a dataset of N graphs, the resulting runtime complexity is then obviously\nin O(N 2n2h4d).\n\nRelated work The subtree kernels in [9] and [1] re\ufb01ne the above de\ufb01nition for applications in\nchemoinformatics and hand-written digit recognition. Mah\u00b4e and Vert [9] de\ufb01ne extensions of the\nclassic subtree kernel that avoid tottering [8] and consider unbalanced subtrees. Both [9] and [1]\npropose to consider \u03b1-ary subtrees with at most \u03b1 children per node. This restricts the set of match-\nings to matchings of up to \u03b1 nodes, but the runtime complexity is still exponential in this parameter\n\u03b1, which both papers describe as feasible on small graphs (with approximately 20 nodes) with many\ndistinct node labels. We present a subtree kernel that is ef\ufb01cient to compute on graphs with hundreds\nand thousands of nodes next.\n\n1The extension of this de\ufb01nition and our results to graphs with edge labels is straightforward, but omitted\n\nfor clarity of presentation.\n\n2\n\n\f3 Fast subtree kernels\n\n3.1 The Weisfeiler-Lehman test of isomorphism\n\nOur algorithm for computing a fast subtree kernel builds upon the Weisfeiler-Lehman test of isomor-\nphism [14], more speci\ufb01cally its 1-dimensional variant, also known as \u201cnaive vertex re\ufb01nement\u201d,\nwhich we describe in the following.\nAssume we are given two graphs G and G(cid:48) and we would like to test whether they are isomorphic.\nThe 1-dimensional Weisfeiler-Lehman test proceeds in iterations, which we index by h and which\ncomprise the following steps:\n\nAlgorithm 1 One iteration of the 1-dimensional Weisfeiler-Lehman test of graph isomorphism\n1: Multiset-label determination\n\n\u2022 For h = 1, set Mh(v) := l0(v) = L(v) for labeled graphs, and Mh(v) := l0(v) =\n|N (v)| for unlabeled graphs.\n\u2022 For h > 1, assign a multiset-label Mh(v) to each node v in G and G(cid:48) which consists of\nthe multiset {lh\u22121(u)|u \u2208 N (v)}.\n\u2022 Sort elements in Mh(v) in ascending order and concatenate them into a string sh(v).\n\u2022 Add lh\u22121(v) as a pre\ufb01x to sh(v).\n\u2022 Sort all of the strings sh(v) for all v from G and G(cid:48) in ascending order.\n\u2022 Map each string sh(v) to a new compressed label, using a function f : \u03a3\u2217 \u2192 \u03a3 such\nthat f(sh(v)) = f(sh(w)) if and only if sh(v) = sh(w).\n\u2022 Set lh(v) := f(sh(v)) for all nodes in G and G(cid:48).\n\n5: Relabeling\n\n2: Sorting each multiset\n\n3: Sorting the set of multisets\n\n4: Label compression\n\nThe sorting step 3 allows for a straightforward de\ufb01nition and implementation of f for the compres-\nsion step 4: one keeps a counter variable for f that records the number of distinct strings that f has\ncompressed before. f assigns the current value of this counter to a string if an identical string has\nbeen compressed before, but when one encounters a new string, one increments the counter by one\nand f assigns its value to the new string. The sorted order from step 3 guarantees that all identical\nstrings are mapped to the same number, because they occur in a consecutive block.\nThe Weisfeiler-Lehman algorithm terminates after step 5 of iteration h if {lh(v)|v \u2208 V } (cid:54)=\n{lh(v(cid:48))|v(cid:48) \u2208 V (cid:48)}, that is, if the sets of newly created labels are not identical in G and G(cid:48). The\ngraphs are then not isomorphic. If the sets are identical after n iterations, the algorithm stops with-\nout giving an answer.\n\nComplexity The runtime complexity of Weisfeiler-Lehman algorithm with h iterations is O(hm).\nDe\ufb01ning the multisets in step 1 for all nodes is an O(m) operation. Sorting each multiset is an\nO(m) operation for all nodes. This ef\ufb01ciency can be achieved by using Counting Sort, which is an\ninstance of Bucket Sort, due to the limited range that the elements of the multiset are from. The\nelements of each multiset are a subset of {f(sh(v))|v \u2208 V }. For a \ufb01xed h, the cardinality of this\nset is upper-bounded by n, which means that we can sort all multisets in O(m) by the following\nprocedure: We assign the elements of all multisets to their corresponding buckets, recording which\nmultiset they came from. By reading through all buckets in ascending order, we can then extract\nthe sorted multisets for all nodes in a graph. The runtime is O(m) as there are O(m) elements in\nthe multisets of a graph in iteration h. Sorting the resulting strings is of time complexity O(m) via\nthe Radix Sort. The label compression requires one pass over all strings and their characters, that is\nO(m). Hence all these steps result in a total runtime of O(hm) for h iterations.\n\n3.2 The Weisfeiler-Lehman kernel on pairs of graphs\n\nBased on the Weisfeiler-Lehman algorithm, we de\ufb01ne the following kernel function.\n\n3\n\n\fDe\ufb01nition 1 The Weisfeiler-Lehman kernel on two graphs G and G(cid:48) is de\ufb01ned as:\n\nW L(G, G(cid:48)) = |{(si(v), si(v(cid:48)))|f(si(v)) = f(si(v(cid:48))), i \u2208 {1, . . . , h}, v \u2208 V, v(cid:48) \u2208 V (cid:48)}|,\nk(h)\n\n(4)\nwhere f is injective and the sets {f(si(v))|v \u2208 V \u222a V (cid:48)} and {f(sj(v))|v \u2208 V \u222a V (cid:48)} are disjoint\nfor all i (cid:54)= j.\n\nThat is, the Weisfeiler-Lehman kernel counts common multiset strings in two graphs.\n\nTheorem 2 The Weisfeiler-Lehman kernel is positive de\ufb01nite.\n\nProof Intuitively, k(h)\nW L is a kernel because it counts matching subtree patterns of up to height h in\ntwo graphs. More formally, let us de\ufb01ne a mapping \u03c6 that counts the occurrences of a particular\nlabel sequence s in G (generated in h iterations of Weisfeiler-Lehman). Let \u03c6(h)\ns (G) denote the\nnumber of occurrences of s in G, and analogously \u03c6(h)\n\ns (G(cid:48)) for G(cid:48). Then\n\nk(h)\ns\n\n(G, G(cid:48)) = \u03c6(h)\n\ns (G)\u03c6(h)\n\ns (G(cid:48)) =\n\nand if we sum over all s from \u03a3\u2217, we obtain\n\n= |{(si(v), si(v(cid:48)))|si(v) = si(v(cid:48)), i \u2208 {1, . . . , h}, v \u2208 V, v(cid:48) \u2208 V (cid:48)}|,\n\nW L(G, G(cid:48)) = (cid:88)\n\nk(h)\n\nk(h)\ns\n\ns\u2208\u03a3\u2217\n\n(G, G(cid:48)) = (cid:88)\n\ns\u2208\u03a3\u2217\n\ns (G)\u03c6(h)\n\u03c6(h)\n\ns (G(cid:48)) =\n\n(5)\n\n(6)\n\n(7)\n\n= |{(si(v), si(v(cid:48)))|si(v) = si(v(cid:48)), i \u2208 {1, . . . , h}, v \u2208 V, v(cid:48) \u2208 V (cid:48)}| =\n= |{(si(v), si(v(cid:48)))|f(si(v)) = f(si(v(cid:48))), i \u2208 {1, . . . , h}, v \u2208 V, v(cid:48) \u2208 V (cid:48)}|,\n\nwhere the last equality follows from the fact that f is injective.\nAs f(s) (cid:54)= s and hence each string s corresponds to exactly one subtree pattern t, k(h)\nkernel with corresponding feature map \u03c6(h)\n\u03c6(h)\nW L(G) = (\u03c6(h)\n\nW L, such that\ns (G))s\u2208\u03a3\u2217 = (\u03c6(h)\n\n(G))t\u2208S(G).\n\nt\n\nW L de\ufb01nes a\n\nTheorem 3 The Weisfeiler-Lehman kernel on a pair of graphs G and G(cid:48) can be computed in\nO(hm).\n\nProof This follows directly from the de\ufb01nition of the Weisfeiler-Lehman kernel and the runtime\ncomplexity of the Weisfeiler-Lehman test, as described in Section 3.1. The number of match-\ning multiset strings can be counted as part of step 3, as they occur consecutively in the sorted order.\n\n3.3 The Weisfeiler-Lehman kernel on N graphs\n\nFor computing the Weisfeiler-Lehman kernel on N graphs we propose the following algorithm\nwhich improves over the naive, N 2-fold application of the kernel from (4). We now process all\nN graphs simultaneously and conduct the steps given in the Algorithm 2 in each of h iterations on\neach graph G.\nThe hash function g can be implemented ef\ufb01ciently: it again keeps a counter variable x which counts\nthe number of distinct strings that g has mapped to compressed labels so far. If g is applied to a string\nthat is different from all previous ones, then the string is mapped to x + 1, and x increments. As\nbefore, g is required to keep sets of compressed labels from different iterations disjoint.\nTheorem 4 For N graphs, the Weisfeiler-Lehman kernel on all pairs of these graphs can be com-\nputed in O(N hm + N 2hn).\nProof Naive application of the kernel from de\ufb01nition (4) for computing an N \u00d7 N kernel matrix\nwould require a runtime of O(N 2hm). One can improve upon this runtime complexity by comput-\ning \u03c6(h)\nW L explicitly. This can be achieved by replacing the compression mapping f in the classic\nWeisfeiler-Lehman algorithm by a hash function g that is applied to all N graphs simultaneously.\n\n4\n\n\f2: Sorting each multiset\n\nAlgorithm 2 One iteration of the Weisfeiler-Lehman kernel on N graphs\n1: Multiset-label determination\n\u2022 Assign a multiset-label Mh(v) to each node v in G which consists of the multiset\n{lh\u22121(u)|u \u2208 N (v)}.\n\u2022 Sort elements in Mh(v) in ascending order and concatenate them into a string sh(v).\n\u2022 Add lh\u22121(v) as a pre\ufb01x to sh(v).\n\u2022 Map each string sh(v) to a compressed label using a hash function g : \u03a3\u2217 \u2192 \u03a3 such\nthat g(sh(v)) = g(sh(w)) if and only if sh(v) = sh(w).\n\u2022 Set lh(v) := g(sh(v)) for all nodes in G.\n\n3: Label compression\n\n4: Relabeling\n\nThis has the following effects on the runtime of Weisfeiler-Lehman: Step 1, the multiset-label de-\ntermination, still requires O(N m). Step 2, the sorting of the elements in each multiset can be done\nvia a joint Bucket Sort (Counting Sort) of all strings, requiring O(N n + N m) time. The use of the\nhash function g renders the sorting of all strings unnecessary (Step 3 from Section 3.1), as identical\nstrings will be mapped to the same (compressed) label anyway. Step 4 and Step 5 remain unchanged.\nThe effort of computing \u03c6(h)\nW L on all N graphs in h iterations is then O(N hm), assuming that\nm > n. To get all pairwise kernel values we have to multiply all feature vectors, which requires a\nruntime of O(N 2hn), as each graph G has at most hn non-zero entries in \u03c6(h)\n\nW L(G).\n\n3.4 Link to the Ramon-G\u00a8artner kernel\n\nThe Weisfeiler-Lehman kernel can be de\ufb01ned in a recursive fashion which elucidates its relation to\nthe Ramon-G\u00a8artner kernel.\n\nTheorem 5 The kernel k(h)\n\nrecursive de\ufb01ned as\n\nh(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nwhere\n\nki(v, v(cid:48)) =\n\n\uf8f1\uf8f2\uf8f3\n\nki(v, v(cid:48)),\n\nrecursive(G, G(cid:48)) =\nk(h)\nv\u2208V\n(cid:81)\n\u03b4(L(v),L(cid:48)(v(cid:48))),\n(w,w(cid:48))\u2208R ki\u22121(w, w(cid:48)),\n0,\n\nv(cid:48)\u2208V (cid:48)\n\ni=1\n\nki\u22121(v, v(cid:48)) maxR\u2208M(v,v(cid:48))\n\nif i = 1\nif i > 1 and M (cid:54)= \u2205\nif i > 1 and M = \u2205\n\nand\n\nM(v, v(cid:48)) = {R \u2286 N (v) \u00d7 N (v(cid:48))|(\u2200(u, u(cid:48)), (w, w(cid:48)) \u2208 R : u = w \u21d4 u(cid:48) = w(cid:48))\n\u2227(\u2200(u, u(cid:48)) \u2208 R : L(u) = L(cid:48)(u(cid:48)) \u2227 |R| = |N (v)| = |N (v(cid:48))|)}\n\nis equivalent to the Weisfeiler-Lehman kernel k(h)\nW L.\n\nProof We prove this theorem by induction over h. Induction initialisation: h = 1:\nW L = |{(s1(v), s1(v(cid:48)))|f(s1(v)) = f(s1(v(cid:48))), v \u2208 V, v(cid:48) \u2208 V (cid:48)}| =\nk(1)\n\n=(cid:88)\n\n(cid:88)\n\nv\u2208V\n\nv(cid:48)\u2208V (cid:48)\n\n\u03b4(L(v),L(cid:48)(v(cid:48))) = k(1)\n\nrecursive.\n\nThe equality follows from the de\ufb01nition of M(v, v(cid:48)).\nInduction step h \u2192 h + 1: Assume that k(h)\nW L = k(h)\nkh+1(v, v(cid:48)) +\n\nrecursive =(cid:88)\n\n(cid:88)\n\nk(h+1)\n\nv\u2208V\n\nv(cid:48)\u2208V (cid:48)\n\nrecursive. Then\n\nh(cid:88)\n\n(cid:88)\n\n(cid:88)\n\ni=1\n\nv\u2208V\n\nv(cid:48)\u2208V (cid:48)\n\nki(v, v(cid:48)) =\n\n= |{(sh+1(v), sh+1(v(cid:48)))|f(sh+1(v)) = f(sh+1(v(cid:48))), v \u2208 V, v(cid:48) \u2208 V (cid:48)}| + k(h)\n\nW L = k(h+1)\nW L ,\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n(12)\n\n(13)\n\n(14)\n\n5\n\n\fFigure 1: Runtime in seconds for kernel matrix computation on synthetic graphs using the pairwise\n(red, dashed) and the global (green) Weisfeiler-Lehman kernel (Default values: dataset size N = 10,\ngraph size n = 100, subtree height h = 5, graph density c = 0.4).\n\nwhere the equality of (13) and (14) follows from the fact that kh+1(v, v(cid:48)) = 1 if and only if the\nneigborhoods of v and v(cid:48) are identical, that is if f(sh+1(v)) = f(sh+1(v(cid:48))).\nTheorem 5 highlights the following differences between the Weisfeiler-Lehman and the Ramon-\nG\u00a8artner kernel: In (8), Weisfeiler-Lehman considers all subtrees up to height h and the Ramon-\nG\u00a8artner kernel the subtrees of exactly height h. In (9) and (10), the Weisfeiler-Lehman kernel checks\nwhether the neighbourhoods of v and v(cid:48) match exactly, whereas the Ramon-G\u00a8artner kernel considers\nall pairs of matching subsets of the neighbourhoods of v and v(cid:48) in (3). In our experiments, we next\nexamine the empirical differences between these two kernels in terms of runtime and prediction\naccuracy on classi\ufb01cation benchmark datasets.\n\n4 Experiments\n\n4.1 Runtime behaviour of Weisfeiler-Lehman kernel\n\nMethods We empirically compared the runtime behaviour of our two variants of the Weisfeiler-\nLehman (WL) kernel. The \ufb01rst variant computes kernel values pairwise in O(N 2hm). The second\nvariant computes the kernel values in O(N hm + N 2hn) on the dataset simultaneously. We will\nrefer to the former variant as the \u2018pairwise\u2019 WL, and the latter as \u2018global\u2019 WL.\n\nExperimental setup We assessed the behaviour on randomly generated graphs with respect to\nfour parameters: dataset size N, graph size n, subtree height h and graph density c. The density\nof an undirected graph of n nodes without self-loops is de\ufb01ned as the number of its edges divided\nby n(n \u2212 1)/2, the maximal number of edges. We kept 3 out of 4 parameters \ufb01xed at their default\nvalues and varied the fourth parameter. The default values we used were 10 for N, 100 for n, 5 for\nh and 0.4 for the graph density c. In more detail, we varied N and n in range {10, 100, 1000}, h in\n{2, 4, 8} and c in {0.1, 0.2, . . . , 0.9}.\nFor each individual experiment, we generated N graphs with n nodes, and inserted edges randomly\nuntil the number of edges reached (cid:98)cn(n \u2212 1)/2(cid:99). We then computed the pairwise and the global\n\n6\n\n10110210310!1100101102103104105Number of graphs NRuntime in seconds02004006008001000050100150200250300350400Graph size nRuntime in seconds 234567805101520Subtree height hRuntime in seconds0.10.20.30.40.50.60.70.80.905101520Graph density cRuntime in secondspairwiseglobal\fWL kernel on these synthetic graphs. We report CPU runtimes in seconds in Figure 1, as measured\nin Matlab R2008a on an Apple MacPro with 3.0GHz Intel 8-Core with 16GB RAM.\n\nResults Empirically, we observe that the pairwise kernel scales quadratically with dataset size N.\nInterestingly, the global kernel scales linearly with N. The N 2 sparse vector multiplications that\nhave to be performed for kernel computation with global WL do not dominate runtime here. This\nresult on synthetic data indicates that the global WL kernel has attractive scalability properties for\nlarge datasets.\nWhen varying the number of nodes n per graph, we observe that the runtime of global WL scales\nlinearly with n, and is much faster than the pairwise WL for large graphs.\nWe observe the same picture for the height h of the subtree patterns. The runtime of both kernels\ngrows linearly with h, but the global WL is more ef\ufb01cient in terms of runtime in seconds.\nVarying the graph density c, both methods show again a linearly increasing runtime, although the\nruntime of the global WL kernel is close to constant. The density c seems to be a graph property\nthat affects the runtime of the pairwise kernel more severely than that of global WL.\nAcross all different graph properties, the global WL kernel from Section 3.3 requires less runtime\nthan the pairwise WL kernel from Section 3.2. Hence the global WL kernel is the variant of our\nWeisfeiler-Lehman kernel that we use in the following graph classi\ufb01cation tasks.\n\n4.2 Graph classi\ufb01cation\n\nDatasets We employed the following datasets in our experiments: MUTAG, NCI1, NCI109, and\nD&D. MUTAG [3] is a dataset of 188 mutagenic aromatic and heteroaromatic nitro compounds\nlabeled according to whether or not they have a mutagenic effect on the Gram-negative bacterium\nSalmonella typhimurium. We also conducted experiments on two balanced subsets of NCI1 and\nNCI109, which classify compounds based on whether or not they are active in an anti-cancer screen\n([13] and http://pubchem.ncbi.nlm.nih.gov). D&D is a dataset of 1178 protein struc-\ntures [4]. Each protein is represented by a graph, in which the nodes are amino acids and two nodes\nare connected by an edge if they are less than 6 Angstroms apart. The prediction task is to classify\nthe protein structures into enzymes and non-enzymes.\n\nExperimental setup On these datasets, we compared our Weisfeiler-Lehman kernel to the Ramon-\nG\u00a8artner kernel (\u03bbr = \u03bbs = 1), as well as to several state-of-the-art graph kernels for large graphs:\nthe fast geometric random walk kernel from [12] that counts common labeled walks (with \u03bb chosen\nfrom the set {10\u22122, 10\u22123, . . . , 10\u22126} by cross-validation on the training set), the graphlet kernel\nfrom [11] that counts common induced labeled connected subgraphs of size 3, and the shortest path\nkernel from [2] that counts pairs of labeled nodes with identical shortest path distance.\nWe performed 10-fold cross-validation of C-Support Vector Machine Classi\ufb01cation, using 9 folds\nfor training and 1 for testing. All parameters of the SVM were optimised on the training dataset\nonly. To exclude random effects of fold assignments, we repeated the whole experiment 10 times.\nWe report average prediction accuracies and standard errors in Tables 1 and 2.\nWe choose h for our Weisfeiler-Lehman kernel by cross-validation on the training dataset for h \u2208\n{1, . . . , 10}, which means that we computed 10 different WL kernel matrices in each experiment.\nWe report the total runtime of this computation (not the average per kernel matrix).\n\nResults\nIn terms of runtime the Weisfeiler-Lehman kernel can easily scale up even to graphs with\nthousands of nodes. On D&D, subtree-patterns of height up to 10 were computed in 11 minutes,\nwhile no other comparison method could handle this dataset in less than half an hour. The shortest\npath kernel is competitive to the WL kernel on smaller graphs (MUTAG, NCI1, NCI109), but on\nD&D its runtime degenerates to more than 23 hours. The Ramon and G\u00a8artner kernel was computable\non MUTAG in approximately 40 minutes, but for the large NCI datasets it only \ufb01nished computation\non a subsample of 100 graphs within two days. On D&D, it did not even \ufb01nish on a subsample of\n100 graphs within two days. The random walk kernel is competitive on MUTAG, but as the Ramon-\nG\u00a8artner kernel, does not \ufb01nish computation on the full NCI datasets and on D&D within two days.\nThe graphlet kernel is faster than our WL kernel on MUTAG and the NCI datasets, and about a\n\n7\n\n\fMethod/Dataset MUTAG\n\nWeisfeiler-Lehman\nRamon & G\u00a8artner\nGraphlet count\nRandom walk\nShortest path\n\n82.05 (\u00b10.36)\n85.72 (\u00b10.49) \u2014-\n75.61 (\u00b10.49)\n80.72 (\u00b10.38) \u2014-\n87.28 (\u00b10.55)\n\nNCI1\n82.19 (\u00b1 0.18)\n66.00 (\u00b10.07)\n73.47 (\u00b10.11)\n\nNCI109\n82.46 (\u00b10.24)\n\u2014-\n66.59 (\u00b10.08)\n\u2014-\n73.07 (\u00b10.11)\n\nD & D\n79.78 (\u00b10.36)\n\u2014-\n78.59 (\u00b10.12)\n\u2014-\n78.45 (\u00b10.26)\n\nTable 1: Prediction accuracy (\u00b1 standard error) on graph classi\ufb01cation benchmark datasets\n\n\u2014-: did not \ufb01nish in 2 days.\n\nDataset MUTAG\n\nMaximum # nodes\nAverage # nodes\n# labels\nNumber of graphs\nWeisfeiler-Lehman\nRamon & G\u00a8artner\nGraphlet count\nRandom walk\nShortest path\n\n28\n\n17.93\n\n7\n188\n6\u201d\n\n3\u201d\n12\u201d\n2\u201d\n\n40\u20196\u201d\n\nNCI1\n111\n29.87\n\n37\n\n100\n5\u201d\n\n25\u20199\u201d\n\n58\u201930\u201d\n\n2\u201d\n\n3\u201d\n\n4110\n7\u201920\u201d\n29 days\u2217\n1\u201927\u201d\n68 days\u2217\n4\u201938\u201d\n\nNCI109\n\n111\n29.68\n\n54\n\n100\n5\u201d\n\n26\u201940\u201d\n\n2h 9\u201941\u201d\n\n2\u201d\n\n3\u201d\n\n4127\n7\u201921\u201d\n31 days\u2217\n1\u201927\u201d\n153 days\u2217\n\n4\u201939\u201d\n\nD & D\n5748\n284.32\n\n89\n\n100\n58\u201d\n\u2014-\n2\u201940\u201d\n\u2014-\n\n58\u201945\u201d\n\n1178\n11\u2019\n\u2014-\n\n30\u201921\u201d\n\n\u2014-\n\n23h 17\u20192\u201d\n\nTable 2: CPU runtime for kernel computation on graph classi\ufb01cation benchmark datasets\n\n\u2014-: did not \ufb01nish in 2 days, * = extrapolated.\n\nfactor of 3 slower on D&D. However, this ef\ufb01ciency comes at a price, as the kernel based on size-3\ngraphlets turns out to lead to poor accuracy levels on three datasets. Using larger graphlets with 4\nor 5 nodes that might have been more expressive led to infeasible runtime requirements in initial\nexperiments (not shown here).\nOn NCI1, NCI109 and D&D, the Weisfeiler-Lehman kernel reached the highest accuracy. On D&D\nthe shortest path and graphlet kernels yielded similarly good results, while on NCI1 and NCI109 the\nWeisfeiler-Lehman kernel improves by more than 8% the best accuracy attained by other methods.\nOn MUTAG, it reaches the third best accuracy among all methods considered. We could not assess\nthe performance of the Ramon & G\u00a8artner kernel and the random walk kernel on larger datasets,\nas their computation did not \ufb01nish in 48 hours. The labeled size-3 graphlet kernel achieves low\naccuracy levels, except on D&D.\nTo summarize, the WL kernel turns out to be competitive in terms of runtime on all smaller datasets,\nfastest on the large protein dataset, and its accuracy levels are highest on three out of four datasets.\n\n5 Conclusions\n\nWe have de\ufb01ned a fast subtree kernel on graphs that combines scalability with the ability to deal\nwith node labels. It is competitive with state-of-the-art kernels on several classi\ufb01cation benchmark\ndatasets in terms of accuracy, even reaching the highest accuracy level on three out of four datasets,\nand outperforms them signi\ufb01cantly in terms of runtime on large graphs, even the ef\ufb01cient computa-\ntion schemes for random walk kernels [12] and graphlet kernels [11] that were recently de\ufb01ned.\nThis new kernel opens the door to applications of graph kernels on large graphs in bioinformatics, for\ninstance, protein function prediction via detailed graph models of protein structure on the amino acid\nlevel, or on gene networks for phenotype prediction. An exciting algorithmic question for further\nstudies will be to consider kernels on graphs with continuous or high-dimensional node labels and\ntheir ef\ufb01cient computation.\n\nAcknowledgements\n\nThe authors would like to thank Kurt Mehlhorn, Pascal Schweitzer, and Erik Jan van Leeuwen for\nfruitful discussions.\n\n8\n\n\fReferences\n[1] F. R. Bach. Graph kernels between point clouds. In ICML, pages 25\u201332, 2008.\n[2] K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In Proc. Intl. Conf. Data\n\nMining, pages 74\u201381, 2005.\n\n[3] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, and C. Hansch.\nStructure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds.\ncorrelation with molecular orbital energies and hydrophobicity. J Med Chem, 34:786\u2013797,\n1991.\n\n[4] P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without\n\nalignments. J Mol Biol, 330(4):771\u2013783, Jul 2003.\n\n[5] T. G\u00a8artner, P.A. Flach, and S. Wrobel. On graph kernels: Hardness results and ef\ufb01cient alter-\nnatives. In B. Sch\u00a8olkopf and M. Warmuth, editors, Sixteenth Annual Conference on Computa-\ntional Learning Theory and Seventh Kernel Workshop, COLT. Springer, 2003.\n\n[6] T. Horvath, T. G\u00a8artner, and S. Wrobel. Cyclic pattern kernels for predictive graph mining. In\nProceedings of the International Conference on Knowledge Discovery and Data Mining, 2004.\n[7] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In\nProceedings of the 20th International Conference on Machine Learning (ICML), Washington,\nDC, United States, 2003.\n\n[8] P. Mah\u00b4e, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Extensions of marginalized graph\nkernels. In Proceedings of the Twenty-First International Conference on Machine Learning,\n2004.\n\n[9] P. Mah\u00b4e and J.-P. Vert. Graph kernels based on tree patterns for molecules. q-bio/0609024,\n\nSeptember 2006.\n\n[10] J. Ramon and T. G\u00a8artner. Expressivity versus ef\ufb01ciency of graph kernels. Technical report, First\nInternational Workshop on Mining Graphs, Trees and Sequences (held with ECML/PKDD\u201903),\n2003.\n\n[11] N. Shervashidze, S.V.N. Vishwanathan, T. Petri, K. Mehlhorn, and K. M. Borgwardt. Ef\ufb01cient\n\ngraphlet kernels for large graph comparison. In Arti\ufb01cial Intelligence and Statistics, 2009.\n\n[12] S. V. N. Vishwanathan, Karsten Borgwardt, and Nicol N. Schraudolph. Fast computation\nIn B. Sch\u00a8olkopf, J. Platt, and T. Hofmann, editors, Advances in Neural\n\nof graph kernels.\nInformation Processing Systems 19, Cambridge MA, 2007. MIT Press.\n\n[13] N. Wale and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval\n\nand classi\ufb01cation. In Proc. of ICDM, pages 678\u2013689, Hong Kong, 2006.\n\n[14] B. Weisfeiler and A. A. Lehman. A reduction of a graph to a canonical form and an algebra\n\narising during this reduction. Nauchno-Technicheskaya Informatsia, Ser. 2, 9, 1968.\n\n9\n\n\f", "award": [], "sourceid": 3813, "authors": [{"given_name": "Nino", "family_name": "Shervashidze", "institution": null}, {"given_name": "Karsten", "family_name": "Borgwardt", "institution": ""}]}