{"title": "The Multiscale Laplacian Graph Kernel", "book": "Advances in Neural Information Processing Systems", "page_first": 2990, "page_last": 2998, "abstract": "Many real world graphs, such as the graphs of molecules, exhibit structure at multiple different scales, but most existing kernels between graphs are either purely local or purely global in character. In contrast, by building a hierarchy of nested subgraphs, the Multiscale Laplacian Graph kernels (MLG kernels) that we define in this paper can account for structure at a range of different scales. At the heart of the MLG construction is another new graph kernel, called the Feature Space Laplacian Graph kernel (FLG kernel), which has the property that it can lift a base kernel defined on the vertices of two graphs to a kernel between the graphs. The MLG kernel applies such FLG kernels to subgraphs recursively. To make the MLG kernel computationally feasible, we also introduce a randomized projection procedure, similar to the Nystro \u0308m method, but for RKHS operators.", "full_text": "The Multiscale Laplacian Graph Kernel\n\nRisi Kondor\n\nHorace Pan\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nDepartment of Statistics\nUniversity of Chicago\n\nChicago, IL 60637\n\nrisi@cs.uchicago.edu\n\nUniversity of Chicago\n\nChicago, IL 60637\n\nhopan@uchicago.edu\n\nAbstract\n\nMany real world graphs, such as the graphs of molecules, exhibit structure at mul-\ntiple different scales, but most existing kernels between graphs are either purely\nlocal or purely global in character. In contrast, by building a hierarchy of nested\nsubgraphs, the Multiscale Laplacian Graph kernels (MLG kernels) that we de\ufb01ne\nin this paper can account for structure at a range of different scales. At the heart\nof the MLG construction is another new graph kernel, called the Feature Space\nLaplacian Graph kernel (FLG kernel), which has the property that it can lift a\nbase kernel de\ufb01ned on the vertices of two graphs to a kernel between the graphs.\nThe MLG kernel applies such FLG kernels to subgraphs recursively. To make the\nMLG kernel computationally feasible, we also introduce a randomized projection\nprocedure, similar to the Nystr\u00a8om method, but for RKHS operators.\n\n1\n\nIntroduction\n\nThere is a wide range of problems in applied machine learning from web data mining [1] to pro-\ntein function prediction [2] where the input space is a space of graphs. A particularly important\napplication domain is chemoinformatics, where the graphs capture the structure of molecules. In\nthe pharmaceutical industry, for example, machine learning algorithms are regularly used to screen\ncandidate drug compounds for safety and ef\ufb01cacy against speci\ufb01c diseases [3].\nBecause kernel methods neatly separate the issue of data representation from the statistical learning\ncomponent, it is natural to formulate graph learning problems in the kernel paradigm. Starting with\n[4], a number of different graph kernels have appeared in the literature (for an overview, see [5]). In\ngeneral, a graph kernel k(G1,G2) must satisfy the following requirements:\n1. The kernel should capture the right notion of similarity between G1 and G2. For example, if G1\nand G2 are social networks, then k might capture to what extent their clustering structure, degree\ndistribution, etc. match. If, on the other hand, G1 and G2 are molecules, then we are probably\nmore interested in what functional groups are present, and how they are arranged relative to each\nother.\n\n2. The kernel is usually computed from the adjacency matrices A1 and A2 of the two graphs, but it\nmust be invariant to the ordering of the vertices. In other words, writing the kernel explicitly in\nterms of A1 and A2, we must have k(A1, A2) = k(A1,P A2P (cid:62)) for any permutation matrix P .\nPermutation invariance has proved to be the central constraint around which much of the graph\nkernels literature is organized, effectively stipulating that graph kernels must be built out of graph\ninvariants. Ef\ufb01ciently computable graph invariants offered by the mathematics literature tend to fall\nin one of two categories:\n1. Local invariants, which can often be reduced to simply counting some local properties, such as\n\nthe number of triangles, squares, etc. that appear in G as subgraphs.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f2. Spectral invariants, which can be expressed as functions of the eigenvalues of the adjacency\n\nmatrix or the graph Laplacian.\n\nCorrespondingly, while different graph kernels are motivated in very different ways from random\nwalks [4] through shortest paths [6, 7] to Fourier transforms on the symmetric group [8], most graph\nkernels in the literature ultimately reduce to computing a function of the two graphs that is either\npurely local or purely spectral. Any of the kernels based on the \u201csubgraph counting\u201d idea (e.g., [9])\nare local. On the other hand, most of the random walk based kernels are reducible to a spectral form\ninvolving the eigenvalues of either the two graphs individually, or their Kronecker product [5] and\ntherefore are really only sensitive to the large scale structure of graphs.\nIn practice, it would be desirable to have a kernel that can take structure into account at multiple\ndifferent scales. A kernel between molecules, for example, should not only be sensitive to the overall\nlarge-scale shape of the graphs (whether they are more like a chain, a ring, a chain that branches,\netc.), but also to what smaller structures (e.g., functional groups) are present in the graphs, and how\nthey are related to the global structure (e.g., whether a particular functional group is towards the\nmiddle or one of the ends of a chain).\nFor the most part, such a multiscale graph kernel has been missing from the literature. Two notable\nexceptions are the Weisfeiler\u2013Lehman kernel [10] and Propagation Kernel [11]. The WL kernel\nuses a combination of message passing and hashing to build summaries of the local neighborhoods\nof vertices at different scales. While shown to be effective, the Weisfeiler\u2013Lehman kernel\u2019s hashing\nstep is somewhat ad-hoc; perturbing the edges by a small amount leads to completely different hash\nfeatures. Similarly, the propagation kernel monitors how the distribution of node/edge labels spreads\nthrough the graph and then uses locality sensitivity hashing to ef\ufb01ciently bin the label distributions\ninto feature vectors.\nMost recently, structure2vec[12] attempts to represent each graph with a latent variable model and\nthen embeds them into a feature space, using the inner product as a kernel. This approach compares\nfavorably to the standard kernel methods in both accuracy and computational ef\ufb01ciency.\nIn this paper we present a new graph kernel, the Multiscale Laplacian Graph Kernel (MLG kernel),\nwhich, we believe, is the \ufb01rst kernel in the literature that can truly compare structure in graphs\nsimultaneously at multiple different scales. We begin by introducing the Feature Space Laplacian\nGraph Kernel(FLG kernel) in Section 2. The FLG kernel operates at a single scale, while combining\ninformation from the nodes\u2019s vertex features with topological information through its Laplacian.\nAn important property of the FLG kernel is that it can work with vertex labels provided as a \u201cbase\nkernel\u201d on the vertices, which allows us to apply the FLG kernel recursively.\nThe MLG kernel, de\ufb01ned in Section 3, uses the FLG kernel\u2019s recursive property to build a hierarchy\nof subgraph kernels that are sensitive to both the topological relationships between individual ver-\ntices, and between subgraphs of increasing sizes. Each kernel is de\ufb01ned in terms of the preceding\nkernel in the hierarchy. Ef\ufb01cient computability is a major concern in our paper, and recursively\nde\ufb01ned kernels on combinatorial data structures, can be very expensive. Therefore, in Section 4\nwe describe a strategy based on a combination of linearizing each level of the kernel (relative to a\ngiven dataset) and a randomized low rank projection step, which reduces every stage of the kernel\ncomputation to simple operations involving small matrices, leading to a very fast algorithm. Finally,\nsection 5 presents experimental comparisons of our kernel with competing methods.\n\n2 Laplacian Graph Kernels\nLet G be a weighted undirected graph with vertex set V = {v1, . . . , vn} and edge set E. Recall that\nthe graph Laplacian of G is an n \u00d7 n matrix LG, with\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\u2212wi,j\n(cid:80)\n\n0\n\nif {vi, vj} \u2208 E\nif i = j\notherwise,\n\nLG\ni,j =\n\nj : {vi,vj}\u2208E wi,j\n\nwhere wi,j is the weight of edge {vi, vj}. The graph Laplacian is positive semi-de\ufb01nite, and in terms\nof the adjacency matrix A and the weighted degree matrix D it can be expressed as L = D\u2212 A.\n\n2\n\n\fSpectral graph theory tells us that the low eigenvalue eigenvectors of LG are informative about the\noverall shape of G. One way of seeing this is to note that for any vector z \u2208 Rn\n\n(cid:88)\n\n{i,j}\u2208E\n\nz(cid:62)LG z =\n\nwi,j(zi \u2212 zj)2,\n\nso the low eigenvalue eigenvectors are the smoothest functions on G, in the sense that they vary the\nleast between adjacent vertices. An alternative interpretation emerges if we use G to construct a\nGaussian graphical model (Markov Random Field or MRF) over n variables x1, . . . , xn with clique\npotentials \u03c6(xi, xj) = e\u2212wi,j (xi\u2212xj )2/2 for each edge and \u03c8(xi) = e\u2212\u03b7x2\ni /2 for each vertex. The\njoint distribution of x = (x1, . . . , xn)\n\np(x) \u221d(cid:16) (cid:89)\n\n(cid:62) is then\n\ne\u2212wi,j (xi\u2212xj )2/2(cid:17)(cid:16)(cid:89)\n\ne\u2212\u03b7x2\n\n= e\u2212x(cid:62)(L+\u03b7I) x/2,\n\n(1)\n\ni /2(cid:17)\n\n{vi,vj}\u2208E\n\nvi\u2208V\n\nshowing that the covariance matrix of x is (LG + \u03b7I)\u22121. Note that the \u03c8 factors were added to\nensure that the distribution is normalizable, and \u03b7 is typically just a small constant \u201cregularizer\u201d:\nLG actually has a zero eigenvalue eigenvector (namely the constant vector n\u22121/2(1, 1, . . . , 1)(cid:62)), so\nwithout adding \u03b7I we would not be able to invert it. In the following we will call LG + \u03b7I the\nregularized Laplacian, and denote it simply by L.\nBoth the above views suggest that if we want de\ufb01ne a kernel between graphs that is sensitive to\ntheir overall shape, comparing the low eigenvalue eigenvectors of their Laplacians is a good place\nto start. Previous work by [13] also used the graph Laplacian for constructing a similarity function\non graphs. Following the MRF route, given two graphs G1 and G2 of n vertices, we can de\ufb01ne the\nkernel between them to be a kernel between the corresponding distributions p1 = N (0, L\u22121\n1 ) and\np2 = N (0, L\u22121\n\n2 ). Speci\ufb01cally, we will use the Bhattacharyya kernel [14]\n\n(2)\n\nbecause for Gaussian distributions it can be computed in closed form, giving\n\n(cid:90) (cid:112)p1(x)(cid:112)p2(x) dx,\n(cid:12)(cid:12)(cid:0) 1\n(cid:1)\u22121(cid:12)(cid:12)1/2\n(cid:12)(cid:12)1/4\n(cid:12)(cid:12)1/4(cid:12)(cid:12)L\u22121\n(cid:12)(cid:12)L\u22121\n\n2 L1 + 1\n\n2 L2\n\n2\n\n1\n\n.\n\nk(p1, p2) =\n\nk(p1, p2) =\n\n2\n\n1 or L\u22121\n\nIf some of the eigenvalues of L\u22121\nare zero or very close to zero, along certain directions\nin space the two distributions in (2) become very \ufb02at, leading to vanishingly small kernel values\n(unless the \u201c\ufb02at\u201d directions of the two Gaussians are perfectly aligned). To remedy this problem,\nsimilarly to [15], we \u201csoften\u201d (or regularize) the kernel by adding some small constant \u03b3 times the\nidentity to L\u22121\nDe\ufb01nition 1. Let G1 and G2 be two graphs with n vertices with (regularized) Laplacians L1 and\nL2, respectively. We de\ufb01ne the Laplacian graph kernel (LG kernel) with parameter \u03b3 between G1\nand G2 as\n\n2 . This leads to what we call the Laplacian Graph Kernel.\n\nand L\u22121\n\n1\n\n(cid:1)\u22121(cid:12)(cid:12)1/2\n\n(cid:12)(cid:12)(cid:0) 1\n\nkLG(G1,G2) =\n\n2 S\u22121\n\n2 S\u22121\n1 + 1\n|S1|1/4 |S2|1/4\n\n2\n\n,\n\n(3)\n\n2 + \u03b3I.\n\n1 + \u03b3I and S2 = L\u22121\n\nwhere S1 = L\u22121\nBy virtue of (2), the LG kernel is positive semi-de\ufb01nite, and because the value of the overlap integral\nis largely determined by the extent to which the subspaces spanned by the largest eigenvalue eigen-\nvectors of L\u22121\nare aligned, it effectively captures similarity between the overall shapes\nof G1 and G2. However, the LG kernel does suffer from three major limitations: it assumes that\nboth graphs have the same number of vertices, it is only sensitive to the overall structure of the two\ngraphs, and it is not invariant to permuting the vertices. Our goal for the rest of this paper is to\novercome each of these limitations, while retaining the LG kernel\u2019s attractive spectral interpretation.\n\nand L\u22121\n\n1\n\n2\n\n2.1 The feature space Laplacian graph kernel (FLG kernel)\n\n(cid:62)\nIn the probabilistic view of the LG kernel, every graph generates random vectors x = (x1, . . . , xn)\naccording to (1), and the kernel between two graphs is determined by comparing the corresponding\n\n3\n\n\fvariables\u201d x1, . . . , xn to \u201cfeature space variables\u201d y1, . . . , ym, where yi = (cid:80)\n\ndistributions. The invariance problem arises because the ordering of the variables x1, . . . , xn is\narbitrary: even if G1 and G2 are topologically the same, kLG(G1,G2) might be low if their vertices\nhappen to be numbered differently.\nOne of the central ideas of this paper is to address this issue by transforming from the \u201cvertex space\nj ti,j(xj), and each\nti,j function may only depend on j through local and reordering invariant properties of vertex vj. If\nwe then compute an analogous kernel to the LG kernel, but now between the distributions of the y\u2019s\nrather than the x\u2019s, the resulting kernel will be permutation invariant.\nIn the simplest case, the ti,j functions are linear, i.e., ti,j(xj) = \u03c6i(vj) \u00b7 xj, where (\u03c61, . . . , \u03c6m) is\na collection of m local (and permutation invariant) vertex features. For example, \u03c6i(vj) may be the\ndegree of vertex vj, or the value of h\u03b2(vj, vj), where h is the diffusion kernel on G with length scale\nparameter \u03b2 (c.f., [16]). In the chemoinformatics setting, the \u03c6i\u2019s might be some way of encoding\nwhat type of atom is located at vertex vj.\nThe linear transform of a multivariate normal random variable is multivariate normal. In our case,\nde\ufb01ning Ui,j = \u03c6i(vj)i,j and y = U x, we have E(y) = 0 and Cov(y, y) = U Cov(x, x)U(cid:62) =\nU L\u22121U(cid:62), leading to the following kernel, which is the workhorse of the present paper.\nDe\ufb01nition 2. Let G1 and G2 be two graphs with regularized Laplacians L1 and L2, respectively, \u03b3 \u2265\n0 a parameter, and (\u03c61, . . . , \u03c6m) a collection of m local vertex features. De\ufb01ne the corresponding\nfeature mapping matrices\n\n[U1]i,j = \u03c6i(vj)\n\n[U2]i,j = \u03c6i(v(cid:48)\nj),\n\nwhere vj is the j\u2019th vertex of G1 and v(cid:48)\nLaplacian graph kernel (FLG kernel) is de\ufb01ned\n\nj is the j\u2019th vertex of G2. The corresponding Feature space\n\n(cid:12)(cid:12)(cid:0) 1\n\nkFLG(G1,G2) =\n\n(cid:1)\u22121(cid:12)(cid:12)1/2\n\n2\n\n2 S\u22121\n\n2 S\u22121\n1 + 1\n|S1|1/4 |S2|1/4\n2 + \u03b3I.\n\n,\n\n(4)\n\n1 U(cid:62)\n\n2 U(cid:62)\n\n1 + \u03b3I and S2 = U2L\u22121\n\nwhere S1 = U1L\u22121\nSince the \u03c61, . . . , \u03c6m vertex features, by de\ufb01nition, are local and invariant to vertex renumbering,\nthe FLG kernel is permutation invariant. Moreover, the distributions now live in the space of features\nrather than the space de\ufb01ned by the vertices, so we can apply the kernel to two graphs with different\nnumbers of vertices. The major remaining shortcoming of the FLG kernel is that it cannot take into\naccount structure at multiple different scales.\n\n2.2 The \u201ckernelized\u201d FLG kernel\n\nThe key to boosting kFLG to a multiscale kernel is that it itself can be \u201ckernelized\u201d, i.e., it can be\ncomputed from just the inner products between the feature vectors of the vertices (which we call the\nbase kernel) without having to know the actual \u03c6i(vj) features values.\nDe\ufb01nition 3. Given a collection \u03c6 = (\u03c61, . . . , \u03c6m)(cid:62) of local vertex features, we de\ufb01ne the corre-\nsponding base kernel \u03ba between two vertices v and v(cid:48) as the dot product of their feature vectors:\n\u03ba(v, v(cid:48)) = \u03c6(v)(cid:62)\u03c6(v(cid:48)).\nNote that in this de\ufb01nition v and v(cid:48) may be two vertices of the same graph, or of two different graphs.\nWe \ufb01rst show that, similarly to other kernel methods [17], to compute kFLG(G1,G2) one only needs\nto consider the subspace of Rm spanned by the feature vectors of their vertices.\nProposition 1. Let G1 and G2 be two graphs with vertex sets V1 = {v1 . . . vn1} and V2 =\n{v(cid:48)\n\n}, and let {\u03be1, . . . , \u03bep} be an orthonormal basis for the subspace\n\nW = span(cid:8)\u03c6(v1), . . . , \u03c6(vn1), \u03c6(v(cid:48)\n\n)(cid:9).\n\n1 . . . v(cid:48)\n\nn2\n\ndim(W ) = p. Then, (4) can be rewritten as\nkFLG(G1,G2) =\n\n(cid:12)(cid:12)(cid:0) 1\n\n1), . . . , \u03c6(v(cid:48)\n\u22121\n2\n\n(cid:1)\u22121(cid:12)(cid:12)1/2\n\nn2\n\n,\n\n\u22121\n1 + 1\n2 S\n|S1|1/4 |S2|1/4\n\n2 S\n\n(5)\n\nwhere [S1]i,j = \u03be(cid:62)\nand S2 to W .\n\ni S1\u03bej and [S2]i,j = \u03be(cid:62)\n\ni S2\u03bej. In other words, S1 and S2 are the projections of S1\n\n4\n\n\fSimilarly to kernel PCA [18] or the Bhattacharyya kernel [15], the easiest way to construct the basis\n{\u03be1, . . . , \u03bep} required by (5) is to compute the eigendecomposition of the joint Gram matrix of the\nvertices of the two graphs.\nProposition 2. Let G1 and G be as in Proposition 1, V = {v1, . . . , vn1+n2} be the union of their\nvertex sets (where it is assumed that the \ufb01rst n1 vertices are {v1, . . . , vn1} and the second n2 vertices\n\n(cid:9)), and de\ufb01ne the joint Gram matrix K \u2208 R(n1+n2)\u00d7(n1+n2) as\n\nare(cid:8)v(cid:48)\n\n1, . . . , v(cid:48)\n\nn2\n\nKi,j = \u03ba(vi, vj) = \u03c6(vi)(cid:62)\u03c6(vj).\n\nLet u1, . . . , up be a maximal orthonormal set of the non-zero eigenvalue eigenvectors of K with\ncorresponding eigenvalues. Then the vectors\n\nn1+n2(cid:88)\n\n(cid:96)=1\n\n\u03bei =\n\n1\u221a\n\u03bbi\n\n[ui](cid:96) \u03c6(v(cid:96))\n\n(6)\n\np up] \u2208 Rp\u00d7p and\nform an orthonormal basis for W . Moreover, de\ufb01ning Q = [\u03bb1/2\nsetting Q1 = Q1:n1, : and Q2 = Qn1+1:n2, : (the \ufb01rst n1 and remaining n2 rows of Q, respectively),\nthe matrices S1 and S2 appearing in (5) can be computed as\nS2 = Q(cid:62)\n\n1 u1, . . . , \u03bb1/2\n\nS1 = Q(cid:62)\n\n(7)\n\n2 Q2 + \u03b3I.\n\n1 Q1 + \u03b3I,\n\n1 L\u22121\n\n2 L\u22121\n\nProofs of these two propositions are given in the Supplemental Material. As in other kernel methods,\nthe signi\ufb01cance of Propositions 1 and 2 is not just that they show how kFLG(G1,G2) can be ef\ufb01ciently\ncomputed when \u03c6 is very high dimensional, but that they make it clear that the FLG kernel may\nbe induced from any base kernel. For completeness, we close this section with the generalized\nde\ufb01nition of the FLG kernel.\nDe\ufb01nition 4. Let G1 and G2 be two graphs. Assume that each of their vertices comes from an\nabstract vertex space V and that \u03ba : V \u00d7 V \u2192 R is a symmetric positive semi-de\ufb01nite kernel on V.\nThe generalized FLG kernel induced from \u03ba is then de\ufb01ned as\n\u22121\n2\n\n(cid:1)\u22121(cid:12)(cid:12)1/2\n\n(cid:12)(cid:12)(cid:0) 1\n\nFLG(G1,G2) =\nk\u03ba\n\n\u22121\n1 + 1\n2 S\n|S1|1/4 |S2|1/4\n\n2 S\n\n,\n\n(8)\n\nwhere S1 and S2 are as de\ufb01ned in Proposition 2.\n\n3 The multiscale Laplacian graph kernel (MLG kernel)\n\nBy multiscale graph kernel we mean a kernel that is able to capture similarity between graphs not\njust based on the topological relationships between their individual vertices, but also the topological\nrelationships between subgraphs. The key property of the FLG kernel that allows us to build such a\nkernel is that it can be applied recursively. In broad terms, the construction goes as follows:\n1. Given a graph G, associate each vertex with a subgraph centered around it and compute the FLG\n\nkernel between every pair of these subgraphs.\n\n2. Reinterpret the FLG kernel between these subgraphs as a new base kernel between the center\n\nvertices of the subgraphs.\n\nduced from the new base kernel constructed in the previous step, and recurse.\n\n3. Consider larger subgraphs centered at each vertex, compute the FLG kernel between them in-\nTo compute the actual multiscale graph kernel K between G and another graph G(cid:48), we follow the\nsame process for G(cid:48) and then set K(G,G(cid:48)) equal to the FLG kernel induced from their top level base\nkernels. The following de\ufb01nitions formalize this construction.\nDe\ufb01nition 5. Let G be a graph with vertex set V , and \u03ba a positive semi-de\ufb01nite kernel on V . Assume\nthat for each v \u2208 V we have a nested sequence of L neighborhoods\n\nv \u2208 N1(v) \u2286 N2(v) \u2286 . . . \u2286 NL(v) \u2286 V,\n\n(9)\nand for each N(cid:96)(v), let G(cid:96)(v) be the corresponding induced subgraph of G. We de\ufb01ne the Multiscale\nLaplacian Subgraph Kernels (MLS kernels), K1, . . . , KL : V \u00d7 V \u2192 R as follows:\n1. K1 is just the FLG kernel k\u03ba\n\nFLG induced from the base kernel \u03ba between the lowest level subgraphs:\n\nK1(v, v(cid:48)) = k\u03ba\n\nFLG(G1(v), G1(v(cid:48))).\n\n5\n\n\f2. For (cid:96) = 2, 3, . . . , L, K(cid:96) is the FLG kernel induced from K(cid:96)\u22121 between G(cid:96)(v) and G(cid:96)(v(cid:48)):\n\nK(cid:96)(v, v(cid:48)) = kK(cid:96)\u22121\n\nFLG (G(cid:96)(v), G(cid:96)(v(cid:48))).\n\nDe\ufb01nition 5 de\ufb01nes the MLS kernel as a kernel between different subgraphs of the same graph G.\nHowever, if two graphs G1 and G2 share the same base kernel, the MLS kernel can also be used to\ncompare any subgraph of G1 with any subgraph of G2. This is what allows us to de\ufb01ne an L + 1\u2019th\nFLG kernel, which compares the two full graphs.\nDe\ufb01nition 6. Let G be a collection of graphs such that all their vertices are members of an abstract\nvertex space V endowed with a symmetric positive semi-de\ufb01nite kernel \u03ba : V \u00d7V \u2192 R. Assume that\nthe MLS kernels K1, . . . , KL are de\ufb01ned as in De\ufb01nition 5, both for pairs of subgraphs within the\nsame graph and across pairs of different graphs. We de\ufb01ne the Multiscale Laplacian Graph Kernel\n(MLG kernel) between any two graphs G1,G2 \u2208 G as\nK(G1,G2) = kKL\n\nFLG(G1,G2).\n\nDe\ufb01nition 5 leaves open the question of how the neighborhoods N1(v), . . . , NL(v) are to be de\ufb01ned.\nIn the simplest case, we set N(cid:96)(v) to be the ball Br(v) (i.e., the set of vertices at a distance at most\nr from v), where r = r0d(cid:96)\u22121 for some d > 1.\n\n3.1 Computational complexity\n\nK(G1,G2) \ufb01rst requires computing KL(v, v(cid:48)) between all(cid:0)n1+n2\n\nDe\ufb01nitions 5 and 6 suggest a recursive approach to computing the MLG kernel: computing\nG1 and G2; each of these kernel evaluations requires computing KL\u22121(v, v(cid:48)) between up to O(n2)\nlevel L\u2212 1 subgraphs, and so on. Following this recursion blindly would require up to O(n2L+2)\nkernel evaluations, which is clearly infeasible.\nThe recursive strategy is wasteful because it involves evaluating the same kernel entries over and\nover again in different parts of the recursion tree. An alternative solution that requires only O(Ln2)\nkernel evaluations would be to \ufb01rst compute K1(v, v(cid:48)) for all (v, v(cid:48)) pairs, then compute K2(v, v(cid:48))\nfor all (v, v(cid:48)) pairs and so on.\n\n(cid:1) pairs of top level subgraphs across\n\n2\n\n4 Linearized Kernels and Low Rank Approximation\n\nComputing the MLG kernel between two graphs, as described in the previous section, may in-\nvolve O(Ln2) kernel evaluations. At the top levels of the hierarchy each G(cid:96)(v) might have \u0398(n)\nvertices, so the cost of a single FLG kernel evaluation can be as high as O(n3). Somewhat pes-\nsimistically, this means that the overall cost of computing kFLG(G1,G2) is O(Ln5). Given a dataset\nof M graphs, computing their Gram matrix requires repeating this for all {G1,G2} pairs, giving\nO(LM 2n5), which is even more problematic. The solution that we propose in this section is to\ncompute for each level (cid:96) = 1, 2, . . . , L + 1 a single joint basis for all subgraphs at the given level\nacross all graphs G1, . . . ,GM . For concreteness, we go back to the de\ufb01nition of the FLG kernel.\nDe\ufb01nition 7. Let G = {G1, . . . ,GM} be a collection of graphs, V1, . . . , VM their vertex sets, and\nassume that V1, . . . , VM \u2286 V for some general vertex space V. Further, assume that \u03ba : V \u00d7V \u2192 R\nis a positive semi-de\ufb01nite kernel on V, H\u03ba is its Reproducing Kernel Hilbert Space, and \u03c6 : V \u2192 H\u03ba\nis the corresponding feature map satisfying \u03ba(v, v(cid:48)) = (cid:104)\u03c6(v), \u03c6(v(cid:48))(cid:105) for any v, v(cid:48) \u2208 V. The joint\n\nvertex feature space of {G1, . . . ,GM} is then WG = span(cid:8)(cid:83)M\nProposition 3. Let N =(cid:80)M\n\nWG is just the generalization of the W space de\ufb01ned in Proposition 1 from two graphs to M. The\nfollowing generalization of Propositions 1 and 2 is then immediate.\ni=1 | Vi |, V = (v1, . . . , vN ) be the concatenation of the vertex sets\nV1, . . . , VM , and K the corresponding joint Gram matrix Ki,j = \u03ba(vi, vj) = (cid:104)\u03c6(vi), \u03c6(vj)(cid:105) . Let\nu1, . . . , uP be a maximal orthonormal set of non-zero eigenvalue eigenvectors of K with corre-\nsponding eigenvalues \u03bb1, . . . , \u03bbP , and P = dim(WG). Then the vectors\n\n{\u03c6(v)}(cid:9).\n\n(cid:83)\n\ni=1\n\nv\u2208Vi\n\nN(cid:88)\n\n\u03bei =\n\n1\u221a\n\u03bbi\n\n[ui](cid:96) \u03c6(v(cid:96))\n\ni = 1, . . . , P\n\n(cid:96)=1\n\n6\n\n\fp uP ] \u2208 RP\u00d7P , and\nform an orthonormal basis for WG. Moreover, de\ufb01ning Q = [\u03bb1/2\nsetting Q1 to be the submatrix of Q composed of its \ufb01rst |V1| rows; Q2 be the submatrix composed\nof the next |V2| rows, and so on, for any Gi,Gj \u2208 G, the generalized FLG kernel induced from \u03ba\n(De\ufb01nition 4) can be expressed as\n\n1 u1, . . . , \u03bb1/2\n\n(cid:12)(cid:12)(cid:0) 1\n\n(cid:1)\u22121(cid:12)(cid:12)1/2\n\nkFLG(Gi,Gj) =\ni Qi + \u03b3I and Sj = Q(cid:62)\n\n2 S\n\n\u22121\nj\n\n\u22121\ni + 1\n2 S\n|Si|1/4 |Sj|1/4\nj L\u22121\n\n,\n\n(10)\n\ni L\u22121\n\nj Qj + \u03b3I.\n\nwhere Si = Q(cid:62)\nThe signi\ufb01cance of Proposition 3 is that S1, . . . , SM are now \ufb01xed matrices that do not need to be\nrecomputed for each kernel evaluation. Once we have constructed the joint basis {\u03be1, . . . , \u03beP}, the\nSi matrix of each graph Gi can be computed independently, as a precomputation step, and individual\nkernel evaluations reduce to just plugging them into (10). At a conceptual level, Proposition 3\nlinearizes the kernel \u03ba by projecting everything down to WG. In particular, it replaces the {\u03c6(vi)}\nRKHS vectors with explicit \ufb01nite dimensional feature vectors given by the corresponding rows of\nQ, just like we had in the \u201cunkernelized\u201d FLG kernel of De\ufb01nition 2.\nFLG, but also\nFor our multiscale kernels this is particularly important, because linearizing not just k\u03ba\nkK1\nFLG, kK2\nFLG, . . ., allows us to compute the MLG kernel level by level, without recursion. After lin-\nearizing the base kernel \u03ba, we attach explicit, \ufb01nite dimensional vectors to each vertex of each graph.\nThen we compute compute kK1\nFLG between all pairs of lowest level subgraphs, and linearizing this ker-\nnel as well, each vertex effectively just gets an updated feature vector. Then we repeat the process\nfor kK2\n\nFLG, and \ufb01nally we compute the MLG kernel K(G1,G2).\n\nFLG . . . kKL\n\n4.1 Randomized low rank approximation\n\nThe dif\ufb01culty in the above approach of course is that at each level (3) is a Gram matrix between all\nvertices of all graphs, so storing it is already very costly, let along computing its eigendecomposition.\nMorever, P = dim(WG) is also very large, so managing the S1, . . . , SM matrices (each of which is\nof size P\u00d7P ) becomes infeasible. The natural alternative is to replace WG by a smaller, approximate\njoint features space, de\ufb01ned as follows.\nDe\ufb01nition 8. Let G, \u03ba,H\u03ba and \u03c6 be de\ufb01ned as in De\ufb01nition 7. Let \u02dcV = (\u02dcv1, . . . , \u02dcv \u02dcN ) be \u02dcN (cid:28) N\nvertices sampled from the joint vertex set V = (v1, . . . , vN ). Then the corresponding subsampled\nvertex feature space is\n\n\u02dcWG = span{ \u03c6(\u02dcv) | \u02dcv \u2208 \u02dcV }.\n\nLet \u02dcP = dim( \u02dcWG). Similarly to before, we construct an orthonormal basis {\u03be1, . . . , \u03be \u02dcP} for \u02dcWG\nby forming the (now much smaller) Gram matrix \u02dcKi,j = \u03ba(\u02dcvi, \u02dcvj), computing its eigenvalues and\neigenvectors, and setting \u03bei = 1\u221a\n\u03bbi\n\n(cid:96)=1[ui](cid:96) \u03c6(\u02dcv(cid:96)). The resulting approximate FLG kernel is\n\n(cid:80) \u02dcN\n\ni\n\nj\n\n\u02dcS\u22121\n\ni L\u22121\n\nkFLG(Gi,Gj) =\n\n\u02dcQi + \u03b3I and \u02dcSj = \u02dcQ(cid:62)\n\n\u02dcS\u22121\ni + 1\n2\n| \u02dcSi|1/4 | \u02dcSj|1/4\nwhere \u02dcSi = \u02dcQ(cid:62)\n\u02dcQj + \u03b3I are the projections of Si and Sj to \u02dcWG.\nWe introduce a further layer of approximation by restricting \u02dcWG to be the space spanned by the\n\ufb01rst \u02c6P < \u02dcP basis vectors (ordered by descending eigenvalue), effectively doing kernel PCA on\n{\u03c6(\u02dcv)}\u02dcv\u2208 \u02dcV , equivalently, a low rank approximation of \u02dcK.\nAssuming that vg\nconsists of the coordinates of the projection of \u03c6(vg\n\nj is the j\u2019th vertex of Gg, in contrast to Proposition 2, now the j\u2019th row of \u02dcQs\n\nj L\u22121\n\nj ) onto \u02dcWG, i.e.,\n\n(11)\n\n,\n\n(cid:1)\u22121(cid:12)(cid:12)1/2\n\n(cid:12)(cid:12)(cid:0) 1\n\n2\n\nj\n\n\u02dcN(cid:88)\n\n(cid:96)=1\n\nj ), \u03c6(\u02dcv(cid:96))(cid:11) =\n(cid:10)\u03c6(vg\n\n\u02dcN(cid:88)\n\n(cid:96)=1\n\n1\u221a\n\u03bbi\n\n[ \u02dcQg]j,i =\n\n1\u221a\n\u03bbi\n\n[ui](cid:96)\n\n[ui](cid:96) \u03ba(vg\n\nj , \u02dcv(cid:96)).\n\nThe above procedure is similar to the popular Nystr\u00a8om approximation for kernel matrices [19, 20],\nexcept that in our case the ultimate goal is not to approximate the Gram matrix (3), but the\n\n7\n\n\fTable 1: Classi\ufb01cation Results (Mean Accuracy \u00b1 Standard Deviation)\nMUTAG[22]\n84.76(\u00b10.32)\n84.50(\u00b12.16)\n84.65(\u00b10.25)\n82.94(\u00b12.33)\n85.50(\u00b12.50)\n73.61(\u00b10.36)\n62.40(\u00b10.27)\n82.44(\u00b11.29)\n80.33(\u00b11.35)\nTIMED OUT\n84.21(\u00b12.61)\n80.83(\u00b11.29)\n\nPROTEINS[2]\n75.43(\u00b11.95)\n73.63(\u00b12.12)\n75.61(\u00b10.45)\n71.63(\u00b10.33)\n71.67(\u00b10.78)\n76.14(\u00b11.95)\n\nENZYMES[2]\n53.75(\u00b11.37)\n52.00(\u00b10.72)\n42.31(\u00b11.37)\n30.95(\u00b10.73)\n28.17(\u00b10.76)\n57.92(\u00b15.39)\n\nPTC[23]\n\n59.97(\u00b11.60)\n60.18(\u00b12.19)\n59.53(\u00b11.71)\n55.88(\u00b10.31)\n59.85(\u00b10.95)\n63.62(\u00b14.69)\n\nNCI1[24]\n\nNCI109[24]\n85.12(\u00b10.29)\n85.32(\u00b10.34)\n73.23(\u00b10.26)\n62.35(\u00b10.28)\nTIMED OUT\n81.30(\u00b10.80)\n\nMethod\nWL\nWL-Edge\nSP\nGraphlet\np\u2013RW\nMLG\n\nS1, . . . , SM matrices used to form the FLG kernel. In practice, we found that the eigenvalues of\nK usually drop off very rapidly, suggesting that W can be safely approximated by a surprisingly\nsmall dimensional subspace ( \u02c6P \u223c 10), and correspondingly the sample size \u02dcN can be kept quite\nsmall as well (on the order of 100). The combination of these two factors makes computing the\nentire stack of kernels feasible, reducing the complexity of computing the Gram matrix for a dataset\nof M graphs of \u03b8(n) vertices each to O(M L \u02dcN 2 \u02c6P 3 + M L \u02dcN 3 + M 2 \u02c6P 3). It is also important to\nnote that this linearization step requires the graphs(not the labels) in the test set to be known during\ntraining in order to project the features of the test graphs onto the low rank approximation of \u02dcWG.\n\n5 Experiments\n\nWe tested the ef\ufb01cacy of the MLG kernel by performing classi\ufb01cation on benchmark bioinformatics\ndatasets using a binary C-SVM solver [21], and compared our classi\ufb01cation results against those\nfrom other representative graph kernels from the literature:\nthe Weisfeiler\u2013Lehman Kernel, the\nWeisfeiler\u2013Lehman Edge Kernel [9], the Shortest Path Kernel [6], the Graphlet Kernel [9], and\nthe p-random Walk Kernel [5].\nWe randomly selected 20% of each dataset to be used as a test set. On the other 80% we did 10 fold\ncross validation to select the parameters for each kernel method to be used on the test set and repeated\nthis setup 10 times. For the Weisfeiler\u2013Lehman kernels, the height parameter h is chosen from\n{1, 2, ..., 5}, the random walk size p for the p-random walk kernel was chosen from {1, 2, ..., 5},\nfor the Graphlets kernel the graphlet size n was chosen from {3, 4, 5}. For the parameters of the\nMLG kernel: we chose \u03b7 from {0.01, 0.1, 1}, radius size n from {1, 2, 3}, number of levels l from\n{1, 2, 3}, and \ufb01xed gamma to be 0.01. For the MLG kernel, we used the given discrete node labels to\ncreate a one-hot binary feature vector for each node and used the dot product between nodes\u2019 binary\nfeature vector labels as the base kernel. All experiments were done on a 16 core Intel E5-2670 @\n2.6GHz processor with 32 GB of memory.\nWe are fairly competitive in accuracy for all datasets except NCI1, and NCI109, where it performs\nbetter than all non-Weisfeiler Lehman kernels. The Supplementary Materials give a more detailed\ndiscussion of the experiments and datasets.\n\n6 Conclusions\n\nIn this paper we have proposed two new graph kernels: (1) The FLG kernel, which is a very simple\nsingle level kernel that combines information attached to the vertices with the graph Laplacian; (2)\nThe MLG kernel, which is a multilevel, recursively de\ufb01ned kernel that captures topological relation-\nships between not just individual vertices, but also subgraphs. Clearly, designing kernels that can\noptimally take into account the multiscale structure of actual chemical compounds is a challenging\ntask that will require further work and domain knowledge. However, it is encouraging that even just\n\u201cstraight out of the box\u201d, tuning only two or three parameters, the MLG kernel is competitive with\nother well known kernels in the literature. Beyond just graphs, the general idea of multiscale kernels\nis of interest for other types of data as well (such as images) that have multiresolution structure, and\nthe way that the MLG kernel chains together local spectral analysis at multiple scales is potentially\napplicable to these domains as well, which will be the subject of further research.\n\nAcknowledgements\nThis work was completed in part with computing resources provided by the University of Chicago\nResearch Computing Center and with the support of DARPA-D16AP00112 and NSF-1320344.\n\n8\n\n\fReferences\n[1] Akihiro Inokuchi, Takashi Washio, and Hiroshi Motoda. Complete mining of frequent patterns from\n\ngraphs: Mining graph data. Machine Learning, 50(3):321\u2013354, 2003.\n\n[2] K. M. Borgwardt, C. S. Ong, S. Sch\u00a8onauer, S. V. N. Vishwanathan, A. J. Smola, and H.-P. Kriegel. Protein\nfunction prediction via graph kernels. In Proceedings of Intelligent Systems in Molecular Biology (ISMB),\nDetroit, USA, 2005.\n\n[3] H. Kubinyi. Drug research: myths, hype and reality. Nature Reviews: Drug Discovery, 2(8):665\u2013668,\n\nAugust 2003.\n\n[4] T. G\u00a8artner. Exponential and geometric kernels for graphs. In NIPS*02 workshop on unreal data, volume\n\nPrinciples of modeling nonvectorial data, 2002.\n\n[5] S. V. N. Vishwanathan, Karsten Borgwardt, Risi Kondor, and Nicol Schraudolph. On graph kernels.\n\nJournal of Machine Learning Research (JMLR), 11, 2010.\n\n[6] Karsten M. Borgwardt and Hans Peter Kriegel. Shortest-path kernels on graphs. In Proceedings of the 5th\nIEEE International Conference on Data Mining(ICDM) 2005), 27-30 November 2005, Houston, Texas,\nUSA, pages 74\u201381, 2005.\n\n[7] Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne, and Karsten M. Borgwardt. Scalable\nkernels for graphs with continuous attributes. In Advances in Neural Information Processing Systemss,\n2013.\n\n[8] Risi Kondor and Karsten Borgwardt. The skew spectrum of graphs. In Proceedings of the International\n\nConference on Machine Learning (ICML), pages 496\u2013503. ACM, 2008.\n\n[9] Nino Shervashidze, S. V. N. Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten M. Borgwardt. Ef\ufb01-\ncient graphlet kernels for large graph comparison. In Proceedings of the Twelfth International Conference\non Arti\ufb01cial Intelligence and Statistics, AISTATS, pages 488\u2013495, 2009.\n\n[10] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M. Borg-\nwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research(JMLR), 12:2539\u20132561,\nNovember 2011.\n\n[11] Marion Neumann, Roman Garnett, Christian Baukhage, and Kristian Kersting. Propagation kernels:\n\nef\ufb01cient graph kernels from propagated information. In Machine Learning, 2016.\n\n[12] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured\n\ndata. In Proceedings of the International Conference on Machine Learning (ICML), 2016.\n\n[13] Fredrik D. Johansson and Devdatt Dubhashi. Learning with similarity functions on graphs using match-\nings of geometric embeddings. In Proceedings of the 21th ACM SIGKDD International Conference on\nKnowledge Discovery and Data Mining, pages 467\u2013476, 2015.\n\n[14] Tony Jebara and Risi Kondor. Bhattacharyya and expected likelihood kernels.\n\nIn Proceedings of the\n\nAnnual Conference on Computational Learning Theory and Kernels Workshop (COLT/KW), 2003.\n\n[15] Risi Kondor and Tony Jebara. A kernel between sets of vectors.\n\nConference on Machine Learning (ICML), 2003.\n\nIn Proceedings of the International\n\n[16] Marc Alexa, Michael Kazhdan, and Leonidas Guibas. A Concise and Provably Informative Multi-Scale\nSignature Based on Heat Diffusion. In Processing of Eurographics Symposium on Geometry Processing,\nvolume 28, 2009.\n\n[17] Bernhard Sch\u00a8olkopf and Alexander J. Smola. Learning with Kernels. MIT Press, 2002.\n[18] S. Mika, B. Sch\u00a8olkopf, A. J. Smola, K.-R. M\u00a8uller, Matthias Scholz, and G. R\u00a8atsch. Kernel PCA and\nde-noising in feature spaces. In Advances in Neural Information Processing Systems 11, pages 536\u2013542,\n1999.\n\n[19] Christopher K. I. Williams and Mattias Seeger. Using the Nystr\u00a8om method to speed up kernel machines.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2001.\n\n[20] Petros Drineas and Michael W. Mahoney. On the Nystr\u00a8om method for approximating a Gram matrix for\n\nimproved kernel-based learning. Journal of Machine Learning Research, 6:2153\u20132175, 2005.\n\n[21] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM Transactions\n\non Intelligent Systems and Technology, 3, 2011.\n\n[22] A.K. Debnat, R. L. Lopez de Compadre, G. Debnath, A. j. Shusterman, and C. Hansch. Structure-activity\nrelationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular or-\nbital energies and hydrophobicity. J Med Chem, 34:786\u201397, 1991.\n\n[23] H.Toivonen, A. Srinivasan, R. D. King, S. Kramer, and C. Helma. Statistical evaluation of the predictive\n\ntoxicology challenge. Bioinformatics, pages 1183\u20131193, 2003.\n\n[24] N. Wale, I. A. Watson, and G. Karypis. Comparison of descriptor spaces for chemical compound retrieval\n\nand classi\ufb01cation. Knowledge and Information Systems, pages 347\u2013375, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1487, "authors": [{"given_name": "Risi", "family_name": "Kondor", "institution": "The University of Chicago"}, {"given_name": "Horace", "family_name": "Pan", "institution": "UChicago"}]}