{"title": "Scalable kernels for graphs with continuous attributes", "book": "Advances in Neural Information Processing Systems", "page_first": 216, "page_last": 224, "abstract": "While graphs with continuous node attributes arise in many applications, state-of-the-art graph kernels for comparing continuous-attributed graphs suffer from a high runtime complexity; for instance, the popular shortest path kernel scales as $\\mathcal{O}(n^4)$, where $n$ is the number of nodes. In this paper, we present a class of path kernels with computational complexity $\\mathcal{O}(n^2 (m + \\delta^2))$, where $\\delta$ is the graph diameter and $m$ the number of edges. Due to the sparsity and small diameter of real-world graphs, these kernels scale comfortably to large graphs. In our experiments, the presented kernels outperform state-of-the-art kernels in terms of speed and accuracy on classification benchmark datasets.", "full_text": "Scalable kernels for graphs with continuous attributes\n\nAasa Feragen, Niklas Kasenburg\n\nMachine Learning and Computational Biology Group\n\nMax Planck Institutes T\u00a8ubingen and DIKU, University of Copenhagen\n\n{aasa,niklas.kasenburg}@diku.dk\n\nJens Petersen1,\n\nMarleen de Bruijne1,2\n\n1DIKU, University of Copenhagen\n2 Erasmus Medical Center Rotterdam\n{phup,marleen}@diku.dk\n\nKarsten Borgwardt\n\nMachine Learning and Computational Biology Group\n\nMax Planck Institutes T\u00a8ubingen\n\nEberhard Karls Universit\u00a8at T\u00a8ubingen\n\nkarsten.borgwardt@tuebingen.mpg.de\n\nAbstract\n\nWhile graphs with continuous node attributes arise in many applications, state-\nof-the-art graph kernels for comparing continuous-attributed graphs suffer from\na high runtime complexity. For instance, the popular shortest path kernel scales\nas O(n4), where n is the number of nodes. In this paper, we present a class of\ngraph kernels with computational complexity O(n2(m + log n + \u03b42 + d)), where\n\u03b4 is the graph diameter, m is the number of edges, and d is the dimension of\nthe node attributes. Due to the sparsity and small diameter of real-world graphs,\nthese kernels typically scale comfortably to large graphs.\nIn our experiments,\nthe presented kernels outperform state-of-the-art kernels in terms of speed and\naccuracy on classi\ufb01cation benchmark datasets.\n\n1\n\nIntroduction\n\nGraph-structured data appears in many application domains of machine learning, reaching from\nSocial Network Analysis to Computational Biology. Comparing graphs to each other is a funda-\nmental problem in learning on graphs, and graph kernels have become an ef\ufb01cient and widely-used\nmethod for measuring similarity between graphs. Highly scalable graph kernels have been proposed\nfor graphs with thousands and millions of nodes, both for graphs without node labels [1] and for\ngraphs with discrete node labels [2]. Such graphs appear naturally in applications such as natural\nlanguage processing, chemoinformatics and bioinformatics. For applications in medical image anal-\nysis, computer vision or even bioinformatics, however, continuous-valued physical measurements\nsuch as shape, relative position or other measured node properties are often important features for\nclassi\ufb01cation. An open challenge, which is receiving increased attention, is to develop a scalable\nkernel on graphs with continuous-valued node attributes.\nWe present the GraphHopper kernel between graphs with real-valued edge lengths and any type of\nnode attribute, including vectors. This kernel is a convolution kernel counting sub-path similarities.\nThe computational complexity of this kernel is O(n2(m + log n + \u03b42 + d)), where n and m are the\nnumber of nodes and edges, respectively; \u03b4 is the graph diameter; and d is the dimension of the node\nattributes. Although \u03b4 = n or m = n2 in the worst case, this is rarely the case in real-world graphs,\nas is also illustrated by our experiments. We \ufb01nd empirically in Section 3.1 that our GraphHopper\nkernel tends to scale quadratically with the number of nodes on real data.\n\n1\n\n\f1.1 Related work\n\nMany popular kernels for structured data are sums of substructure kernels:\n\nk(G, G(cid:48)) =\n\nksub(s, s(cid:48)).\n\n(cid:88)\n\n(cid:88)\n\ns\u2208S\n\ns(cid:48)\u2208S (cid:48)\n\nHere G and G(cid:48) are structured data objects such as strings, trees and graphs with classes S and S (cid:48)\nof substructures, and ksub is a substructure kernel. Such k are instances of R-convolution kernels [3].\nA large variety of kernels exist for structures such as strings [4, 5], \ufb01nite state transducers [6] and\ntrees [5, 7]. For graphs in general, kernels can be sorted into categories based on the types of\nattributes they can handle. The graphlet kernel [1] compares unlabeled graphs, whereas several\nkernels allow node labels from a \ufb01nite alphabet [2, 8]. While most kernels have a runtime that\nis at least O(n3), the Weisfeiler-Lehman kernel [2] uses ef\ufb01cient sorting, hashing and counting\nalgorithms that take advantage of repeated occurrences of node labels from the \ufb01nite label alphabet,\nand achieves a runtime which is at most quadratic in the number of nodes. Unfortunately, this does\nnot generalize to graphs with vector-valued node attributes, which are typically all distinct samples\nfrom an in\ufb01nite alphabet.\nThe \ufb01rst kernel to take advantage of non-discrete node labels was the random walk kernel [9\u201311].\nIt incorporates edge probabilities and geometric node attributes [12], but suffers from tottering [13]\nand is empirically slow. Kriege et al. [14] adopt the idea of comparing matched subgraphs, includ-\ning vector-valued attributes on nodes and edges. However, this kernel has a high computational\nand memory cost, as we will see in Section 3. Other kernels handling non-discrete attributes use\nedit-distance and subtree enumeration [15]. While none of these kernels scale well to large graphs,\nthe propagation kernel [16] is fast asymptotically and empirically.\nIt translates the problem of\ncontinuous-valued attributes to a problem of discrete-valued labels by hashing node attributes. Nev-\nertheless, its performance depends strongly on the hashing function and in our experiments it is\noutperformed in classi\ufb01cation accuracy by kernels which do not discretize the attributes.\nIn problems where continuous-valued node attributes and inter-node distance dG(v, w) along the\ngraph G are important features, the shortest path kernel [17], de\ufb01ned as\n\nkSP (G, G(cid:48)) =\n\nkn(v, v(cid:48)) \u00b7 kl (dG(v, w), dG(cid:48)(v(cid:48), w(cid:48))) \u00b7 kn(w, w(cid:48)),\n\n(cid:88)\n\n(cid:88)\n\nv,w\u2208V\n\nv(cid:48),w(cid:48)\u2208V (cid:48)\n\nperforms well in classi\ufb01cation. In particular, kSP allows the user to choose any kernels kn and kl on\nnodes and shortest path length. However, the asymptotic runtime of kSP is generally O(n4), which\nmakes it unfeasible for many real-world applications.\n\n1.2 Our contribution\n\nIn this paper we present a kernel which also compares shortest paths between node pairs from the two\ngraphs, but with a different path kernel. Instead of comparing paths via products of kernels on their\nlengths and endpoints, we compare paths through kernels on the nodes encountered while \u201dhopping\u201d\nalong shortest paths. This particular path kernel allows us to decompose the graph kernel as a\nweighted sum of node kernels, initially suggesting a potential runtime as low as O(n2d). The graph\nstructure is encoded in the node kernel weights, and the main algorithmic challenge becomes to\nef\ufb01ciently compute these weights. This is a combinatorial problem, which we solve with complexity\nO(n2(m + log n + \u03b42)). Note, moreover, that the GraphHopper kernel is parameter-free except for\nthe choice of node kernels.\nThe paper is organized as follows. In Section 2 we give short formal de\ufb01nitions and proceed to\nde\ufb01ning our kernel and investigating its computational properties. Section 3 presents experimental\nclassi\ufb01cation results on different datasets in comparison to state-of-the-art kernels as well as empir-\nical runtime studies, before we conclude with a discussion of our \ufb01ndings in Section 4.\n\n2 Graphs, paths and GraphHoppers\nWe shall compare undirected graphs G = (V, E) with edge lengths l : E \u2192 R+ and node attributes\nA : V \u2192 X from a set X, which can be any set with a kernel kn; in our data X = Rd. Denote\n\n2\n\n\fn = |V | and m = |E|. A subtree T \u2282 G is a subgraph of G which is a tree. Such subtrees inherit\nnode attributes and edge lengths from G by restricting the attribute and length maps A and l to the\nnew node and edge sets, respectively. For a tree T = (V, E, r) with a root node r, let p(v) and c(v)\ndenote the parent and the children of any v \u2208 V .\nGiven nodes va, vb \u2208 V , a path \u03c0 from va to vb in G is de\ufb01ned as a sequence of nodes\n\n\u03c0 = [v1, v2, v3, . . . , vn] ,\n\nwhere v1 = va, vn = vb and [vi, vi+1] \u2208 E for all i = 1, . . . , n \u2212 1. Let \u03c0(i) = vi denote the ith\nnode encountered when \u201dhopping\u201d along the path. Given paths \u03c0 and \u03c0(cid:48) from v to w and from w to\nu, respectively, let [\u03c0, \u03c0(cid:48)] denote their composition, which is a path from v to u. Denote by l(\u03c0) the\nweighted length of \u03c0, given by the sum of lengths l(vi, vi+1) of edges traversed along the path, and\ndenote by |\u03c0| the discrete length of \u03c0, de\ufb01ned as the number of nodes in \u03c0. The shortest path \u03c0ab\nfrom va to vb is de\ufb01ned in terms of weighted length; if no edge length function is given, set l(e) = 1\nfor all e \u2208 E as default. The diameter \u03b4(G) of G is the maximal number of nodes in a shortest path\nin G, with respect to weighted path length.\nIn the next few lemmas we shall prove that for a \ufb01xed a source node v \u2208 V , the directed edges along\nshortest paths from v to other nodes of G form a well-de\ufb01ned directed acyclic graph (DAG), that is,\na directed graph with no cycles.\nFirst of all, subpaths of shortest paths \u03c0vw with source node v are shortest paths as well:\nLemma 1.\n[18, Lemma 24.1] If \u03c01n = [v1, . . . , vn] is a shortest path from v1 = v to vn, then the\n(cid:3)\npath \u03c01n(1 : i) consisting of the \ufb01rst i nodes of \u03c01n is a shortest path from v1 = v to vi.\nGiven a source node v \u2208 G, construct the directed graph Gv = (Vv, Ev) consisting of all nodes Vv\nfrom the connected component of v in G and the set Ev of all directed edges found in any shortest\npath from v to any given node w in Gv. Any directed walk from v in Gv is a shortest path in G:\nLemma 2 If \u03c01n is a shortest path from v1 = v to vn and (vn, vn+1) \u2208 Ev, then [\u03c01n, [vn, vn+1]]\nis a shortest path from v1 = v to vn+1.\nProof. Since (vn, vn+1) \u2208 Ev, there is a shortest path \u03c01(n+1) = [v1, . . . , vn, vn+1] from v1 = v to\nvn+1. If this path is shorter than [\u03c01n, [vn, vn+1]], then \u03c01(n+1)(1 : n) is a shortest path from v1 = v\nto vn by Lemma 1, and it must be shorter than \u03c01n. This is impossible, since \u03c01n is a shortest path.(cid:3)\nProposition 3 The shortest path graph Gv is a DAG.\nProof. Assume, on the contrary, that Gv contains a cycle c = [v1, . . . , vn] where (vi, vi+1) \u2208 Ev\nfor each i = 1, . . . , n \u2212 1 and v1 = vn. Let \u03c0v1 be the shortest path from v to v1. Using Lemma 2\nrepeatedly, we see that the path [\u03c0v1, c] is a shortest path from v to vn = v1, which is impossible\n(cid:3)\nsince the new path must be longer than the shortest path \u03c0v1.\n\n2.1 The GraphHopper kernel\nWe de\ufb01ne the GraphHopper kernel as a sum of path kernels kp over the families P, P(cid:48) of shortest\npaths in G, G(cid:48):\n\n(cid:88)\n\nIn this paper, the path kernel kp(\u03c0, \u03c0(cid:48)) is a sum of node kernels kn on nodes simultaneously encoun-\ntered while simultaneously hopping along paths \u03c0 and \u03c0(cid:48) of equal discrete length, that is:\n\nkp(\u03c0, \u03c0(cid:48)) =\n\nk(G, G(cid:48)) =\n\nkp(\u03c0, \u03c0(cid:48)),\n\n\u03c0\u2208P,\u03c0(cid:48)\u2208P(cid:48)\n\n(cid:26) (cid:80)|\u03c0|\nj=1 kn (\u03c0(j), \u03c0(cid:48)(j))\n(cid:88)\n\n(cid:88)\n\n0\n\nk(G, G(cid:48)) =\n\nw(v, v(cid:48))kn(v, v(cid:48)),\n\nv\u2208V\n\nv(cid:48)\u2208V (cid:48)\n\nif |\u03c0| = |\u03c0(cid:48)|,\notherwise.\n\nIt is clear from the de\ufb01nition that k(G, G(cid:48)) decomposes as a sum of node kernels:\n\n(4)\n\n(5)\n\nwhere w(v, v(cid:48)) counts the number of times v and v(cid:48) appear at the same hop, or coordinate, i of\nshortest paths \u03c0, \u03c0(cid:48) of equal discrete length |\u03c0| = |\u03c0(cid:48)|. We can decompose the weight w(v, v(cid:48)) as\n\nw(v, v(cid:48)) =\n\n(cid:93){(\u03c0, \u03c0(cid:48))|\u03c0(i) = v, \u03c0(cid:48)(i) = v(cid:48),|\u03c0| = |\u03c0(cid:48)| = j} = (cid:104)M (v), M (v(cid:48))(cid:105),\n\n\u03b4(cid:88)\n\n\u03b4(cid:88)\n\nj=1\n\ni=1\n\n3\n\n\fFigure 1: Top: Expansion from the graph G, to the DAG G\u02dcv, to a larger tree S\u02dcv. Bottom left:\nRecursive computation of the ov\nr in a\nrooted tree as in Algorithm 2, and of the dv\n\n\u02dcv. Bottom middle and right: Recursive computation of the dv\n\n\u02dcv on a DAG G\u02dcv as in Algorithm 3.\n\nwhere M (v) is a \u03b4 \u00d7 \u03b4 matrix whose entry [M (v)]ij counts how many times v appears at the ith\ncoordinate of a shortest path in G of discrete length j, and \u03b4 = max{\u03b4(G), \u03b4(G(cid:48))}. More precisely,\n[M (v)]ij = number of times v appears as the ith node on a shortest path of discrete length j\n\n\u02dcv\u2208V number of times v appears as ith node on a shortest path from \u02dcv\nof discrete length j\n\u02dcv\u2208V D\u02dcv(v, j \u2212 i + 1)O\u02dcv(v, i).\n\n=(cid:80)\n=(cid:80)\n\n(6)\nHere D\u02dcv is a n\u00d7 \u03b4 matrix whose (v, i)-coordinate counts the number of directed walks with i nodes\nstarting at v in the shortest path DAG G\u02dcv. The O\u02dcv is a n \u00d7 \u03b4 matrix whose (v, i)-coordinate counts\nthe number of directed walks from \u02dcv to v in G\u02dcv with i nodes. Given the matrices D\u02dcv and O\u02dcv, we\ncompute all M (v) by looping through all choices of source node \u02dcv \u2208 V , adding up the contributions\nM\u02dcv to M (v) from each \u02dcv, as detailed in Algorithm 4.\n\u02dcv, is computed recursively by message-passing from the root, as\nThe vth row of O\u02dcv, denoted ov\n\u02dcv consists of the nodes v \u2208 V for which the shortest\ndetailed in Figure 1 and Algorithm 1. Here, V j\npaths \u03c0\u02dcvv of highest discrete length have j nodes. Algorithm 1 sends one message of size at most \u03b4\nper edge, thus has complexity O(m\u03b4).\nTo compute the vth row of D\u02dcv, denoted dv\n\u02dcv are\ncomputed easily for trees using a message-passing algorithm as follows. Let T = (V, E, r) be a tree\nr counts the number of paths from v in T of\nwith a designated root node r. The ith coef\ufb01cient of dv\ndiscrete length i, directed from the root. This is just the number of descendants of v at level i below\nv in T . Let \u2295 denote left aligned addition of vectors of possibly different length, e.g.\n\n\u02dcv, we draw inspiration from [19] where the vectors dv\n\n[a, b, c] \u2295 [d, e] = [(a + d), (b + e), c].\n\n(7)\n\nUsing \u2295, the dv\n\nr can be expressed recursively:\ndv\nr = [1]\n\n(cid:77)\n\np(w)=v\n\n[0, dw\n\nr ].\n\n\u02dcv for all v, on G\u02dcv\n\n\u02dcv = [1]; ov\n\n\u02dcv = [0] \u2200 v \u2208 V \\ {\u02dcv}.\n\nAlgorithm 1 Message-passing algorithm for computing ov\n1: Initialize: o\u02dcv\n2: for j = 1 . . . \u03b4 do\nfor v \u2208 V j\n\u02dcv do\n3:\n4:\n5:\n6:\n7:\n8: end for\n\nfor (v, w) \u2208 E\u02dcv do\n\u02dcv \u2295 [0, ov\n\u02dcv]\nend for\n\now\n\u02dcv = ow\n\nend for\n\n4\n\n11213212112132211213222210 10 1 1+ 0 1+ 0 1+ 0 1+ 0 0 10 1 1+ 0 0 1+ 0 0 1+ 0 0 1 1+ 0 0 1 10 0 2 1+ 0 0 10 0 2 1111+ 0 1+ 0 11 2+ 0 1+ 0 1 21 2 2+ 0 1 2 21 1 2 2111+ 0 1+ 0 11 2+ 0 1+ 0 1+ 0 1+ 0 1 2+ 0 1+ 0 1 21 4 2+ 0 1 4 21 3 6 2\fAlgorithm 2 Recursive computation of dv\n1: Initialize: dv\n2: for e = (v, c(v)) \u2208 E do\n3:\n4: end for\n\nr = [1] \u2200 v \u2208 V .\nr \u2295 [0, dc(v)\n\ndv\nr = dv\n\n]\n\nr\n\nr for all v on T = (V, E, r).\n\nAlgorithm 3 Recursive computation of dv\n\u02dcv = [1] \u2200 v \u2208 V .\n1: Initialize: dv\n2: for e = (v, c(v)) \u2208 EG do\n\u02dcv \u2295 [0, dc(v)\n3:\n4: end for\n\ndv\n\u02dcv = dv\n\n\u02dcv\n\n]\n\n\u02dcv for all v on G\u02dcv\n\nr for all v \u2208 V are computed recursively, sending counters along the edges from the leaf nodes\nThe dv\ntowards the root, recording the number of descendants of any node at any level, see Algorithm 2 and\nr for all v \u2208 V are computed in O(nh) time, where h is tree height, since each edge\nFigure 1. The dv\npasses exactly one message of size \u2264 h.\nOn a DAG, computing dv\n\u02dcv is a little more complex. Note that the DAG G\u02dcv generated by all shortest\npaths from \u02dcv \u2208 V can be expanded into a rooted tree S\u02dcv by duplicating any node with several\nincoming edges, see Figure 1. The tree S\u02dcv contains, as a path from the root \u02dcv to one of the nodes\nlabeled v in S\u02dcv, any shortest path from \u02dcv to v in G. However, the number of nodes in S\u02dcv could, in\ntheory, be exponential in n, making computation of dv\n\u02dcv by message-passing on S\u02dcv intractable. Thus,\nwe shall compute the dv\n\u02dcv in S\u02dcv are given by\n\u02dcv ], where \u2295 is de\ufb01ned in (7). This observation leads to an algorithm in\ndv\n(w,v)\u2208E\u02dcv\nwhich each edge e \u2208 E\u02dcv passes exactly one vector of size \u2264 \u03b4 + 1 in the direction of the root \u02dcv,\nstarting at the leaves of the DAG G\u02dcv and computing updated descendant vectors for each receiving\nnode. See Algorithm 3 and Figure 1. The complexity of Algorithm 3, which computes dv\n\u02dcv for all\nv \u2208 V , is O(|E\u02dcv|\u03b4) \u2264 O(m\u03b4).\n\n\u02dcv on the DAG G\u02dcv rather than on S\u02dcv. As on trees, the dv\n[0, dw\n\n\u02dcv = [1] \u2295(cid:76)\n\n2.2 Computational complexity analysis\nGiven the w(v, v(cid:48)) and the kn(v, v(cid:48)) for all v \u2208 V and v(cid:48) \u2208 V (cid:48), the kernel can be computed in\nO(n2) time. If we assume that each node kernel kn(v, v(cid:48)) can be computed in O(d) time (as is the\ncase with many standard kernels including Gaussian and linear kernels), then all kn(v, v(cid:48)) can be\nprecomputed in O(n2d) time. Given the matrices M (v) and M (v(cid:48)) for all v \u2208 V , v(cid:48) \u2208 V (cid:48), each\nw(v, v(cid:48)) requires O(\u03b42) time, giving O(n2\u03b42) complexity for computing all weights w(v, v(cid:48)).\nNote that Algorithm 4 computes M (v) for all v \u2208 G simultaneously. Adding the time complexities\nof the lines in each iteration of the algorithm as given on the right hand side of the individual lines\nin Algorithm 4, the total complexity of one iteration of Algorithm 4 is\n\nO(cid:0)(mn + n log n) + m\u03b4 + m\u03b4 + n\u03b42 + n\u03b42(cid:1) = O(n(m + log n + \u03b42)),\n\nAlgorithm 4 Algorithm simultaneously computing all M (v)\n1: Initialize: M (v) = 0 \u2208 R\u03b4\u00d7\u03b4 for each v \u2208 V .\n2: for all \u02dcv \u2208 V do\n3:\n4:\n5:\n6:\n\ncompute shortest path DAG G\u02dcv rooted at \u02dcv using Dijkstra\ncompute D\u02dcv(v) for each v \u2208 V\ncompute O\u02dcv(v) for each v \u2208 V\nfor each v \u2208 V , compute the \u03b4 \u00d7 \u03b4 matrix M\u02dcv(v) given by\n\n(cid:26) D\u02dcv(v, j \u2212 i + 1)O\u02dcv(v, i) when i \u2264 j\n\notherwise,\n\n[M\u02dcv(v)]ij =\n\n0\n\nupdate M (v) = M (v) + M\u02dcv(v) for each v \u2208 V\n\n7:\n8: end for\n\n5\n\n(O(mn + n log n))\n(O(m\u03b4))\n(O(m\u03b4))\n\n(O(n\u03b42))\n(O(n\u03b42))\n\n\fgiving total complexity O(n2(m+log n+\u03b42)) for computing M (v) for all v \u2208 V using Algorithm 4.\nIt follows that the total complexity of computing k(G, G(cid:48)) is\n\nO(n2 + n2d + n2\u03b42 + n2\u03b42 + n2(m + log n + \u03b42)) = O(n2(m + log n + d + \u03b42)).\n\nWhen computing the kernel matrix Kij = k(Gi, Gj) for a set {Gi}N\ni=1 of graphs with N > m +\nn + \u03b42, note that Algorithm 4 only needs to be run once for every graph Gi. Thus, the average\ncomplexity of computing one kernel value out of all Kij becomes\n\n(cid:0)NO(n2(m + log n + \u03b42)) + N 2O(n2 + n2d + \u03b42)(cid:1) \u2264 O(n2d).\n\n1\nN 2\n\n3 Experiments\n\nClassi\ufb01cation experiments were made with the proposed GraphHopper kernel and several alterna-\ntives: The propagation kernel PROP [16], the connected subgraph matching kernel CSM [14] and\nthe shortest path kernel SP [17] all use continuous-valued attributes. In addition, we benchmark\nagainst the Weisfeiler-Lehman kernel WL [2], which only uses discrete node attributes. All ker-\nnels were implemented in Matlab, except for CSM, where a Java implementation was supplied by\nN. Kriege. For the WL kernel, the Matlab implementation available from [20] was used. For the\nGraphHopper and SP kernels, shortest paths were computed using the BGL package [21] imple-\nmented in C++. The PROP kernel was implemented in two different versions, both using the total\nvariation hash function, as the Hellinger distance is only directly applicable to positive vector-valued\nattributes. For PROP-diff, labels were propagated with the diffusion scheme, whereas in PROP-WL\nlabels were \ufb01rst discretised via hashing and then the WL kernel [2] update was used. The bin width\nof the hash function was set to 10\u22125 as suggested in [16]. The PROP-diff, PROP-WL and the WL\nkernel were each run with 10 iterations. In the CSM kernel, the clique size parameter was set to\nk = 5. Our kernel implementations and datasets (with the exception of AIRWAYS) can be found at\nhttp://image.diku.dk/aasa/software.php.\nClassi\ufb01cation experiments were made on four datasets: ENZYMES, PROTEINS, AIRWAYS and\nSYNTHETIC. ENZYMES and PROTEINS are sets of proteins from the BRENDA database [22]\nand the dataset of Dobson and Doig [23], respectively. Proteins are represented by graphs as follows.\nNodes represent secondary structure elements (SSEs), which are connected whenever they are neigh-\nbors either in the amino acid sequence or in 3D space [24]. Each node has a discrete type attribute\n(helix, sheet or turn) and an attribute vector containing physical and chemical measurements includ-\ning length of the SSE in \u02daAngstr\u00f8m ( \u02daA), distance between the C\u03b1 atom of its \ufb01rst and last residue\nin \u02daA, its hydrophobicity, van der Waals volume, polarity and polarizability. ENZYMES comes with\nthe task of classifying the enzymes to one out of 6 EC top-level classes, whereas PROTEINS comes\nwith the task of classifying into enzymes and non-enzymes. AIRWAYS is a set of airway trees ex-\ntracted from CT scans of human lungs [25, 26]. Each node represents an airway branch, attributed\nwith its length. Edges represent adjacencies between airway bronchi. AIRWAYS comes with the\ntask of classifying airways into healthy individuals and patients suffering from Chronic Obstructive\nPulmonary Disease (COPD). SYNTHETIC is a set of synthetic graphs based on a random graph G\nwith 100 nodes and 196 edges, whose nodes are endowed with normally distributed scalar attributes\nsampled from N (0, 1). Two classes A and B each with 150 attributed graphs were generated from\nG by randomly rewiring edges and permuting node attributes. Each graph in A was generated by\nrewiring 5 edges and permuting 10 node attributes, and each graph in B was generated by rewiring\n10 edges and permuting 5 node attributes, after which noise from N (0, 0.452) was added to every\nnode attribute in every graph. Detailed metrics of the datasets are found in Table 1.\nBoth GraphHopper, SP and CSM depend on freely selected node kernels for continuous attributes,\ngiving modeling \ufb02exibility. For the ENZYMES, AIRWAYS and SYNTHETIC datasets, a Gaussian\nnode kernel kn(v, v(cid:48)) = e\u2212\u03bb(cid:107)A(v)\u2212A(v(cid:48))(cid:107)2 was used on the continuous-valued attribute, with \u03bb =\n1/d. For the PROTEINS dataset, the node kernel was a product of a Gaussian kernel with \u03bb = 1/d\nand a Dirac kernel on the continuous- and discrete-valued node attributes, respectively. For the WL\nkernel, discrete node labels were used when available (in ENZYMES and PROTEINS); otherwise\nnode degree was used as node label.\nClassi\ufb01cation was done using a support vector machine (SVM) [27]. The SVM slack parameter was\ntrained using nested cross validation on 90% of the entire dataset, and the classi\ufb01er was tested on the\n\n6\n\n\fNumber of nodes\nNumber of edges\nGraph diameter\nNode attribute dimension\nDataset size\nClass size\n\nENZYMES\n\nPROTEINS AIRWAYS\n\nSYNTHETIC\n\n32.6\n46.7\n12.8\n18\n600\n6 \u00d7 100\n\n39.1\n72.8\n11.6\n\n1\n\n1113\n\n221\n220\n21.1\n\n1\n\n1966\n\n100\n196\n7\n1\n300\n\n663/450\n\n980/986\n\n150/150\n\nTable 1: Data statistics: Average node and edge counts and graph diameter, dataset and class sizes.\n\nKernel\nGraphHopper\nPROP-diff [16]\nPROP-WL [16]\nSP [17]\nCSM [14]\nWL [2]\n\nENZYMES\n\n69.6 \u00b1 1.3 (12(cid:48)10(cid:48)(cid:48))\n37.2 \u00b1 2.2 (13(cid:48)(cid:48))\n48.5 \u00b1 1.3 (1(cid:48)9(cid:48)(cid:48))\n71.0 \u00b1 1.3 (3 d)\n48.0 \u00b1 0.9 (18(cid:48)(cid:48))\n\n69.4 \u00b1 0.8\n\nPROTEINS\n\nAIRWAYS\n\n74.1 \u00b1 0.5 (2.8 h)\n66.8 \u00b1 0.5 (1 d 7 h)\n73.3 \u00b1 0.4 (26(cid:48)(cid:48))\n63.5 \u00b1 0.5 (4(cid:48)12(cid:48)(cid:48))\n73.1 \u00b1 0.8 (2(cid:48)40(cid:48)(cid:48))\n61.5 \u00b1 0.6 (8(cid:48)17(cid:48)(cid:48))\n75.5 \u00b1 0.8 (7.7 d)\nOUT OF TIME\nOUT OF MEMORY OUT OF MEMORY\n75.6 \u00b1 0.5 (2(cid:48)51(cid:48)(cid:48))\n62.0 \u00b1 0.6 (7(cid:48)43(cid:48)(cid:48))\n\nSYNTHETIC\n\n86.6 \u00b1 1.0 (12(cid:48)10(cid:48)(cid:48))\n46.1 \u00b1 1.9 (1(cid:48)21(cid:48)(cid:48))\n44.5 \u00b1 1.2 (1(cid:48)52(cid:48)(cid:48))\n85.4 \u00b1 2.1 (3.4 d)\nOUT OF TIME\n43.3 \u00b1 2.3 (2(cid:48)8(cid:48)(cid:48))\n\nTable 2: Mean classi\ufb01cation accuracies with standard deviation for all experiments, signi\ufb01cantly\nbest accuracies in bold. OUT OF MEMORY means that 100 GB memory was not enough. OUT\nOF TIME indicates that the kernel computation did not \ufb01nish within 30 days. Runtimes are given in\nparentheses; see Section 3.1 for further runtime studies. Above, x(cid:48)y(cid:48)(cid:48) means x minutes, y seconds.\n\nremaining 10%. This experiment was repeated 10 times. Mean accuracies with standard deviations\nare reported in Table 2. For each kernel and dataset, runtime is given in parentheses in Table 2.\nRuntimes for the CSM kernel are not included, as this implementation was in another language.\n\n3.1 Runtime experiments\n\nAn empirical evaluation of the runtime dependence on the parameters n, m and \u03b4 is found in Fig-\nure 2. In the top left panel, average kernel evaluation runtime was measured on datasets of 10 random\nn(n\u22121)/2,\ngraphs with 10, 20, 30, . . . , 500 nodes each, and a density of 0.4. Density is de\ufb01ned as\ni.e. the fraction of edges in the graph compared to the number of edges in the complete graph. In the\ntop right panel, the number of nodes was kept constant n = 100, while datasets of 10 random graphs\nwere generated with 110, 120, . . . , 500 edges each. Development of both average kernel evaluation\nruntime and graph diameter is shown. In the bottom panels, the relationship between runtime and\ngraph diameter is shown on subsets of 100 and 200 of the real AIRWAYS and PROTEINS datasets,\nrespectively, for each diameter.\n\nm\n\n3.2 Results and discussion\nOur experiments on ENZYMES and AIRWAYS clearly demonstrate that there are real-world clas-\nsi\ufb01cation problems where continuous-valued attributes make a big contribution to classi\ufb01cation per-\nformance. Our experiments on SYNTHETIC demonstrate how the more discrete types of kernels,\nPROP and WL, are unable to classify the graphs. Already on SYNTHETIC, which is a modest-sized\nset of modest-sized graphs, CSM and SP are too computationally demanding to be practical, and on\nAIRWAYS, which is a larger set of larger trees, they cannot \ufb01nish in 30 days. The CSM kernel [14]\nhas asymptotic runtime O(knk+1), where k is a parameter bounding the size of subgraphs consid-\nered by the kernel, and thus in order to study subgraphs of relevant size, its runtime will be at least\nas high as the shortest path kernel. Moreover, the CSM kernel requires the computation of a product\ngraph which, for graphs with hundreds of nodes, can cause memory problems, which we also \ufb01nd in\nour experiments. The PROP kernel is fast; however, the reason for the computational ef\ufb01ciency of\nPROP is that it is not really a kernel for continuous valued features \u2013 it is a kernel for discrete fea-\ntures combined with a hashing scheme to discretize continuous-valued features. In our experiments,\nthese hashing schemes do not prove powerful enough to compete in classi\ufb01cation accuracy with the\nkernels that really do use the continuous-valued features.\nWhile ENZYMES and AIRWAYS bene\ufb01t signi\ufb01cantly from including continuous attributes, our\nexperiments on PROTEINS demonstrate that there are also classi\ufb01cation problems where the most\nimportant information is just as well summarized in a discrete feature: here our combination of\n\n7\n\n\fFigure 2: Dependence of runtime on n, \u03b4 and m on synthetic and real graph datasets.\n\ncontinuous and discrete node features gives equal classi\ufb01cation performance as the more ef\ufb01cient\nWL kernel using only discrete attributes.\nWe proved in Section 3.1 that the GraphHopper kernel has asymptotic runtime O(n2(d+m+log n+\n\u03b42)), and that the average runtime for one kernel evaluation in a Gram matrix is O(n2d) when the\nnumber of graphs exceeds m + n + \u03b42. Our experiments in Section 3.1 empirically demonstrate how\nruntime depends on the parameters n, m and \u03b4. As m and \u03b4 are dependent parameters, the runtime\ndependence on m and \u03b4 is not straightforward. An increase in the number of edges m typically\nleads to an increased graph diameter \u03b4 for small m, but for more densely connected graphs, \u03b4 will\ndecrease with increasing m as seen in the top right panel of Figure 2. A consequence of this is\nthat graph diameter rarely becomes very large compared to m. The same plot also shows that the\nruntime increases slowly with increasing m. Our runtime experiments clearly illustrate that while in\nthe worst case scenario we could have m = n2 or \u03b4 = n, this rarely happens in real-world graphs,\nwhich are often sparse and with small diameter. Our experiments also illustrate an average runtime\nquadratic in n on large datasets, as expected based on complexity analysis.\n\n4 Conclusion\nWe have de\ufb01ned the GraphHopper kernel for graphs with any type of node attributes, presented\nan ef\ufb01cient algorithm for computing it, and demonstrated that it outperforms state-of-the-art graph\nkernels on real and synthetic data in terms of classi\ufb01cation accuracy and/or speed. The kernels are\nable to take advantage of any kind of node attributes, as they can integrate any user-de\ufb01ned node\nkernel. Moreover, the kernel is parameter-free except for the node kernels.\nThis kernel opens the door to new application domains such as computer vision or medical imaging,\nin which kernels that work solely on graphs with discrete attributes were too restrictive so far.\n\nAcknowledgements\n\nThe authors wish to thank Nils Kriege for sharing his code for computing the CSM kernel, Nino Shervashidze\nand Chlo\u00b4e-Agathe Azencott for sharing their preprocessed chemoinformatics data, and Asger Dirksen and\nJesper Pedersen for sharing the AIRWAYS dataset. This work is supported by the Danish Research Council for\nIndependent Research | Technology and Production, the Knud H\u00f8ygaard Foundation, AstraZeneca, The Danish\nCouncil for Strategic Research, Netherlands Organisation for Scientic Research, and the DFG project \u201dKernels\nfor Large, Labeled Graphs (LaLa)\u201d. The research of Professor Dr. Karsten Borgwardt was supported by the\nAlfried Krupp Prize for Young University Teachers of the Alfried Krupp von Bohlen und Halbach-Stiftung.\n\n8\n\n051015202500.0050.010.0150.020.0250.030.035121416182022242600.10.20.30.40.50.60.705010015020025030035040045050000.511.522.533.5410015020025030035040045050000.050.10.150.20.250.302468101214\fReferences\n[1] N. Shervashidze, S.V.N. Vishwanathan, T. Petri, K. Mehlhorn, and K.M. Borgwardt. Ef\ufb01cient graphlet\n\nkernels for large graph comparison. JMLR, 5:488\u2013495, 2009.\n\n[2] N. Shervashidze, P. Schweitzer, E.J. van Leeuwen, K. Mehlhorn, and K.M. Borgwardt. Weisfeiler-\n\nLehman graph kernels. JMLR, 12:2539\u20132561, 2011.\n\n[3] D. Haussler. Convolution kernels on discrete structures. Technical report, Department of Computer\n\nScience, University of California at Santa Cruz, 1999.\n\n[4] M. Collins and N. Duffy. Convolution kernels for natural language. In NIPS, pages 625\u2013632, 2001.\n[5] S.V.N. Vishwanathan and A.J. Smola. Fast kernels for string and tree matching. In NIPS, pages 569\u2013576,\n\n2002.\n\n[6] C. Cortes, P. Haffner, and M. Mohri. Rational kernels: Theory and algorithms. JMLR, 5:1035\u20131062,\n\n2004.\n\n[7] D. Kimura and H. Kashima. Fast computation of subpath kernel for trees. In ICML, 2012.\n[8] P. Mah\u00b4e and J.-P. Vert. Graph kernels based on tree patterns for molecules. Machine Learning, 75:3\u201335,\n\n2009.\n\n[9] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In ICML, pages\n\n321\u2013328, 2003.\n\n[10] T. G\u00a8artner, P. Flach, and S. Wrobel. On graph kernels: Hardness results and ef\ufb01cient alternatives. In\n\nLearning Theory and Kernel Machines, volume 2777 of LNCS, pages 129\u2013143, 2003.\n\n[11] S.V.N. Vishwanathan, N.N. Schraudolph, R.I. Kondor, and K.M. Borgwardt. Graph kernels. JMLR,\n\n11:1201\u20131242, 2010.\n\n[12] F.R. Bach. Graph kernels between point clouds. In ICML, pages 25\u201332, 2008.\n[13] P. Mah\u00b4e, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert. Extensions of marginalized graph kernels. In\n\nICML, 2004.\n\n[14] N. Kriege and P. Mutzel. Subgraph matching kernels for attributed graphs. In ICML, 2012.\n[15] B. Ga\u00a8uz`ere, L. Brun, and D. Villemin. Two new graphs kernels in chemoinformatics. Pattern Recognition\n\nLetters, 15:2038\u20132047, 2012.\n\n[16] M. Neumann, N. Patricia, R. Garnett, and K. Kersting. Ef\ufb01cient graph kernels by randomization.\n\nECML/PKDD (1), pages 378\u2013393, 2012.\n\nIn\n\n[17] K.M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. ICDM, 2005.\n[18] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algorithms (3. ed.). MIT Press,\n\n2009.\n\n[19] A. Feragen, J. Petersen, D. Grimm, A. Dirksen, J.H. Pedersen, K. Borgwardt, and M. de Bruijne. Geo-\n\nmetric tree kernels: Classi\ufb01cation of COPD from airway tree geometry. In IPMI 2013, 2013.\n\n[20] N. Shervashidze. Graph kernels code, http://mlcb.is.tuebingen.mpg.de/Mitarbeiter/\n\nNino/Graphkernels/.\n\n[21] D. Gleich. MatlabBGL http://dgleich.github.io/matlab-bgl/.\n[22] I. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg. Brenda, the\n\nenzyme database: updates and major new developments. Nucleic Acids Research, 32:431\u2013433, 2004.\n\n[23] P.D. Dobson and A.J. Doig. Distinguishing enzyme structures from non-enzymes without alignments.\n\nJournal of Molecular Biology, 330(4):771 \u2013 783, 2003.\n\n[24] K.M. Borgwardt, C.S. Ong, S. Sch\u00a8onauer, S.V.N. Vishwanathan, A.J. Smola, and H.-P. Kriegel. Protein\n\nfunction prediction via graph kernels. Bioinformatics, 21(suppl 1):i47\u2013i56, 2005.\n\n[25] J. Pedersen, H. Ashraf, A. Dirksen, K. Bach, H. Hansen, P. Toennesen, H. Thorsen, J. Brodersen, B. Skov,\nM. D\u00f8ssing, J. Mortensen, K. Richter, P. Clementsen, and N. Seersholm. The Danish randomized lung\ncancer CT screening trial - overall design and results of the prevalence round. J Thorac Oncol, 4(5):608\u2013\n614, May 2009.\n\n[26] J. Petersen, M. Nielsen, P. Lo, Z. Saghir, A. Dirksen, and M. de Bruijne. Optimal graph based seg-\nmentation using \ufb02ow lines with application to airway wall segmentation. In IPMI, LNCS, pages 49\u201360,\n2011.\n\n[27] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Trans. Int. Syst. and\nTech., 2:27:1\u201327:27, 2011. Software available at http://www.csie.ntu.edu.tw/\u02dccjlin/\nlibsvm.\n\n9\n\n\f", "award": [], "sourceid": 189, "authors": [{"given_name": "Aasa", "family_name": "Feragen", "institution": "MPI T\u00fcbingen & University of Copenhagen"}, {"given_name": "Niklas", "family_name": "Kasenburg", "institution": "MPI T\u00fcbingen & University of Copenhagen"}, {"given_name": "Jens", "family_name": "Petersen", "institution": "University of Copenhagen"}, {"given_name": "Marleen", "family_name": "de Bruijne", "institution": "Erasmus MC"}, {"given_name": "Karsten", "family_name": "Borgwardt", "institution": "MPI T\u00fcbingen & University of T\u00fcbingen"}]}