{"title": "RetGK: Graph Kernels based on Return Probabilities of Random Walks", "book": "Advances in Neural Information Processing Systems", "page_first": 3964, "page_last": 3974, "abstract": "Graph-structured data arise in wide applications, such as computer vision, bioinformatics, and social networks. Quantifying similarities among graphs is a fundamental problem. In this paper, we develop a framework for computing graph kernels, based on return probabilities of random walks. The advantages of our proposed kernels are that they can effectively exploit various node attributes, while being scalable to large datasets. We conduct extensive graph classification experiments to evaluate our graph kernels. The experimental results show that our graph kernels significantly outperform other state-of-the-art approaches in both accuracy and computational efficiency.", "full_text": "RetGK: Graph Kernels based on Return Probabilities\n\nof Random Walks\n\nZhen Zhang, Mianzhi Wang, Yijian Xiang, Yan Huang, and Arye Nehorai\n\nDepartment of Electrical and Systems Engineering\n\nWashington University in St. Louis\n\nSt. Louis, MO 63130\n\n{zhen.zhang, mianzhi.wang, yijian.xiang, yanhuang640, nehorai}@wustl.edu\n\nAbstract\n\nGraph-structured data arise in wide applications, such as computer vision, bioinfor-\nmatics, and social networks. Quantifying similarities among graphs is a fundamen-\ntal problem. In this paper, we develop a framework for computing graph kernels,\nbased on return probabilities of random walks. The advantages of our proposed\nkernels are that they can effectively exploit various node attributes, while being\nscalable to large datasets. We conduct extensive graph classi\ufb01cation experiments to\nevaluate our graph kernels. The experimental results show that our graph kernels\nsigni\ufb01cantly outperform existing state-of-the-art approaches in both accuracy and\ncomputational ef\ufb01ciency.\n\n1\n\nIntroduction\n\nStructured data modeled as graphs arise in many application domains, such as computer vision,\nbioinformatics, and social network mining. One interesting problem for graph-type data is quantifying\ntheir similarities based on the connectivity structure and attribute information. Graph kernels, which\nare positive de\ufb01nite functions on graphs, are powerful similarity measures, in the sense that they\nmake various kernel-based learning algorithms, for example, clustering, classi\ufb01cation, and regression,\napplicable to structured data. For instance, it is possible to classify proteins by predicting whether a\ngiven protein is an enzyme or not.\nThere are several technical challenges in developing effective graph kernels: (i) When designing graph\nkernels, one might come across the graph isomorphism problem, a well-known NP problem. The\nkernels should satisfy the isomorphism-invariant property, while being informative on the topological\nstructure difference. (ii) Graphs are usually coupled with multiple types of node attributes, e.g.,\ndiscrete1 or continuous attributes. For example, a chemical compound may have both discrete and\ncontinuous attributes, which respectively describe the type and position of atoms. A crucial problem\nis how to integrate the graph structure and node attribute information in graph kernels. (iii) In some\napplications, e.g., social networks, graphs tend to be very large, with thousands or even millions of\nnodes, which requires strongly scalable graph kernels.\nIn this work, we propose novel methods to overcome these challenges. We revisit the concept\nof random walks, introducing a new node structural role descriptor, the return probability feature\n(RPF). We rigorously show that the RPF is isomorphism-invariant and encodes very rich connectivity\ninformation. Moreover, RPF allows us to consider attributed and nonattributed graphs in a uni\ufb01ed\nframework. With the RPF, we can embed (non-)attributed graphs into a Hilbert space. After that, we\nnaturally obtain our return probability-based graph kernels (\"RetGK\" for short). Combining with the\napproximate feature maps technique, we represent each graph with a multi-dimensional tensor and\ndesign a family of computationally ef\ufb01cient graphs kernels.\n\n1In the literature, the discrete node attributes are usually called \"labels\".\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fRelated work. There are various graph kernels, many of which explore the R-convolutional frame-\nwork [12]. The key idea is decomposing a whole graph into small substructures and building graph\nkernels based on the similarities among these components. Such kernels differ from each other in the\nway they decompose graphs. For example, graphlet kernels [26] are based on small subgraphs up to\na \ufb01xed size. Weisfeiler-Lehman graph kernels [25] and tree-based kernels [6] are developed with\nsubtree patterns. Shortest path kernels [1] are derived by comparing the paths between graphs. Still\nother graph kernels, such as [30] and [10], are developed by counting the number of common random\nwalks on direct product graphs. Recently, subgraph matching kernels [18] and graph invariant kernels\n[22] were proposed for handling continuous attributes. However, all the above R-convolution based\ngraph kernels suffer from a drawback. As pointed out in [32], increasing the size of substructures\nwill largely decrease the probability that two graphs contain similar substructures, which usually\nresults in the \"diagonal dominance issue\" [14]. Our return probability based kernels are signi\ufb01cantly\ndifferent from the above ones. We measure the similarity between two graphs by directly comparing\ntheir node structural role distributions, avoiding substructures decomposition.\nMore recently, new methods have been proposed for comparing graphs, which is done by quantifying\nthe dissimilarity between the distributions of pairwise distances between nodes. [24] uses the shortest\npath distance, and [29] uses the diffusion distance. However, these methods can be applied only to\nnon-attributed (unlabeled) graphs, which largely limits their applications in the real world.\nOrganization. In Section 2, we introduce the necessary background, including graph concepts and\ntensor algebra. In Section 3, we discuss the favorable properties of and computational methods for\nRPF. In Section 4, we present the Hilbert space embedding of graphs, and develop the corresponding\ngraph kernels. In Section 5, we show the tensor representation of graphs, and derive computational\nef\ufb01cient graph kernels. In Section 6, we report the experimental results on 21 benchmark datasets. In\nthe supplementary material, we provide proofs of all mathematical results in the paper.\n\n2 Background\n\n2.1 Graph concepts\nAn undirect graph G consists of a set of nodes VG = {v1, v2, ..., vn} and a set of edges EG \u2286 VG\u00d7VG.\nEach edge (vi, vj) is assigned with a positive value wij describing the connection strength between vi\nand vj. For an unweighted graph, all the edge weights are set to be one, i.e., wij = 1,\u2200(vi, vj) \u2208 EG.\nTwo graphs G and H are isomorphic if there exists a permutation map \u03c4 : VG \u2192 VH, such that\n\n\u2200(vi, vj) \u2208 EG,(cid:0)\u03c4 (vi), \u03c4 (vj)(cid:1) \u2208 EH, and the corresponding edge weights are preserved.\nis diagonal matrix whose diagonal terms are DG(i, i) =(cid:80)\nsummation of all node degrees, i.e., VolG =(cid:80)n\n\nThe adjacent matrix AG is an n \u00d7 n symmetric matrix with AG(i, j) = wij. The degree matrix DG\nwij. The volume of G is the\ni=1 DG(i, i). An S-step walk starting from node v0\nis a sequence of nodes {v0, v1, v2, ..., vS}, with (vs, vs+1) \u2208 EG, 0 \u2264 s \u2264 S \u2212 1. A random walk\non G is a Markov chain (X0, X1, X2, ...), whose transition probabilities are\n\n(vi,vj )\u2208EG\n\nPr(Xi+1 = vi+1|Xi = vi, ..., X0 = v0) = Pr(Xi+1 = vi+1|Xi = vi) =\n\nwij\n\nDG(i, i)\n\n,\n\n(1)\n\nwhich induces the transition probability matrix PG = D\u22121\nG is the s-step\ntransition matrix, where P s\nIn our paper, we also consider the case that nodes are associated with multiple attributes. Let A\ndenote a attribute domain. Typically, A can be a alphabet set or a subset of a Euclidean space, which\ncorresponds to discrete attributes and continuous attributes, respectively.\n\nG(i, j) is the transition probability in s steps from node vi to vj.\n\nG AG. More generally, P s\n\n2.2 Tensor algebra\nA tensor [17] is a multidimensional array, which has multiple indices.2 We use RI1\u00d7I2\u00d7...\u00d7IN to\ndenote the set of tensors of order N with dimension (I1, I2, ..., IN ). If U \u2208 RI1\u00d7I2\u00d7...\u00d7IN , then\nUi1i2,...,iN \u2208 R, where 1 \u2264 i1 \u2264 I1, ..., 1 \u2264 iN \u2264 IN .\n\n2A vector (cid:126)u \u2208 RD is a \ufb01rst-order tensor, and a matrix A \u2208 RD1\u00d7D2 is a second-order tensor.\n\n2\n\n\fThe inner product between tensors U, V \u2208 RI1\u00d7I2\u00d7...\u00d7IN is de\ufb01ned such that\n\nI1(cid:88)\n\nI2(cid:88)\n\nIN(cid:88)\n\n(cid:104)U, V (cid:105)T = vec(U )T vec(V ) =\n\n...\n\nUi1i2,...,iN Vi1i2,...,iN .\n\n(2)\n\nA rank-one tensor W \u2208 RI1\u00d7I2\u00d7...\u00d7IN is the tensor (outer) product of N vectors, i.e., W =\n(cid:126)w(1) \u25e6 (cid:126)w(2) \u25e6 ... \u25e6 (cid:126)w(N ), Wi1i2,...,iN = (cid:126)w(1)\n\n.\n\n(cid:126)w(2)\ni2\n\n... (cid:126)w(N )\niN\n\ni1\n\ni1=1\n\ni2=1\n\niN =1\n\n3 Return probabilities of random walks\n\nGiven a graph G, as we can see from (1), the transition probability matrix, PG, encodes all the\nconnectivity information, which leads to a natural intuition: We can compare two graphs by quantify-\ning the difference between their transition probability matrices. However, big technical dif\ufb01culties\nexist, since the sizes of two matrices are not necessarily the same, and their rows or columns do not\ncorrespond in most cases.\nTo tackle the above issues, we make use of the S-step return probabilities of random walks on G. To\ndo this, we assign each node vi \u2208 VG an S-dimensional feature called \"return probability feature\"\n(\"RPF\" for short), which describes the \"structural role\" of vi, i.e.,\n\n(3)\nG(i, i), s = 1, 2, ..., S, is the return probability of a s-step random walk starting from vi.\nG = {(cid:126)p1, (cid:126)p2, ..., (cid:126)pn}. The\n\nwhere P s\nNow each graph is represented by a set of feature vectors in RS: RPFS\nRPF has three nice properties: isomorphism-invariance, multi-resolution, and informativeness.\n\nG(i, i), ..., P S\n\nG(i, i), P 2\n\nG (i, i)]T ,\n\n(cid:126)pi = [P 1\n\n3.1 The properties of RPF\n\nH (\u03c4 (i), \u03c4 (i)).\n\nG(i, i) = P s\n\nIsomorphism-invariance. The isomorphism-invariance property of return probability features is\nsummarized in the following proposition.\nProposition 1. Let G and H be two isomorphic graphs of n nodes, and let \u03c4 : {1, 2, ..., n} \u2192\n{1, 2, ..., n} be the corresponding isomorphism. Then,\n\u2200vi \u2208 VG, s = 1, 2, ...,\u221e, P s\n\n(4)\nH, \u2200S = 1, 2, ...,\u221e. Such\nG = RPFS\nClearly, isomorphic graphs have the same set of RPF, i.e., RPFS\na property can be used to check graph isomorphism, i.e., if \u2203S, s.t. RPFS\nG (cid:54)= RPFS\nH, then G and H\nare not isomorphic. Moreover, Proposition 1 allows us to directly compare the structural role of any\ntwo nodes in different graphs, without considering the matching problems.\nMulti-resolution. RPF characterizes the \"structural role\" of nodes with multi-resolutions. Roughly\nspeaking, P s\nG(i, i) re\ufb02ects the interaction between node vi and the subgraph involving vi. With an\nincrease in s, the subgraph becomes larger. We use a toy example to illustrate our idea. Fig. 1(a)\npresents an unweighted graph G, and C1, C2, and C3 are three center nodes in G, which play different\nstructural roles. In Fig. 1(b), we plot their s-step return probabilities, s = 1, 2, ..., 200. C1, C2, and\nC3 have the same degree, as do their neighbors. Thus their \ufb01rst two return probabilities are the same.\nSince C1 and C2 share the similar neighbourhoods at larger scales, their return probability values\nare close until the eighth step. Because C3 plays a very different structural role from C1 and C2, its\nreturn probabilities values deviate from those of C1 and C2 in early steps.\nIn addition, as shown in Fig. 1(b), when the random walk step s approaches in\ufb01nity, the return\nprobability P s\nG(i, i) will not change much and will converge to a certain value, which is known as\nthe stationary probability in Markov chain theory [5]. Therefore, if s is already suf\ufb01ciently large, we\ngain very little new information from the RPF by increasing s.\nInformativeness. The RPF provides very rich information on the graph structure, in the sense that if\ntwo graphs has the same RPF sets, they share very similar spectral properties.\nTheorem 1. Let G and H be two connected graphs of the same size n and volume Vol, and let PG\nand PH be the corresponding transition probability matrices. Let {(\u03bbk, (cid:126)\u03c8k)}n\nk=1 and {(\u00b5k, (cid:126)\u03d5k)}n\nbe eigenpairs of PG and PH, respectively. Let \u03c4 : {1, 2, ..., n} \u2192 {1, 2, ..., n} be a permutation\nmap. If P s\n\nH (\u03c4 (i), \u03c4 (i)),\u2200vi \u2208 VG,\u2200s = 1, 2, ..., n, i.e., RPFn\n\nG(i, i) = P s\n\nG = RPFn\n\nH, then,\n\nk=1\n\n3\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Toy Graph G; (b) The s-step return probability of the nodes C1, C2 and C3 in the toy\ngraph, s = 1, 2, ..., 200. The nested \ufb01gure is a close-up view of the rectangular region.\n\nH, \u2200S = n + 1, n + 2, ...,\u221e;\n\nG = RPFS\n\n1. RPFS\n2. {\u03bb1, \u03bb2, ..., \u03bbn} = {\u00b51, \u00b52, ..., \u00b5n};\n3. If the eigenvalues sorted by their magnitudes satisfy: |\u03bb1| > |\u03bb2| > ... > |\u03bbm| > 0,\n|\u03bbm+1| = ... = |\u03bbn| = 0, then we have that | (cid:126)\u03c8k(i)| = | (cid:126)\u03d5k(\u03c4 (i))|, \u2200vi \u2208 VG, \u2200k =\n1, 2, ..., m.\n\nG and RPFS\n\nG, S \u2265 n\nThe \ufb01rst conclusion states that the graph structure information contained in RPFn\nare the same, coinciding with our previous discussions on RPF with large random walk steps. The\nsecond and third conclusions bridge the RPF with spectral representations of graphs [4], which\ncontains almost all graph structure information.\nRelation to eigenvector embeddings (EE). One popular way of embedding graph nodes in a Eu-\nclidean space uses the eigenvectors of Laplacian or adjacent matrices as the coordinates. In [21], a\nclass of graph kernels are developed based on the eigenvector embeddings. From Theorem 1, we\nsee that both RPF and EE encode the spectral information of graphs. However, our RPF has several\nadvantages over EE. (i) The eigenvector embeddings re\ufb02ect the closeness among nodes in the same\ngraph, which makes it dif\ufb01cult to compare node across graphs. (ii) The EE representations, which are\ncomputed up to a change in sign (or more generally, orthonormal transformation in the eigenspace),\nmay not be invariant under graph isomorphisms. A counterexample is shown in Fig. 2. G and G\u2019 are\ntwo isomorphic graphs, we visualize their \ufb01rst three-dimensional embeddings with RPF and EE 3.\nIt can be seen that RPFs are invariant while eigenvectors are not. (iii) The eigenvector embeddings\nare unstable. The perturbation theory says that two eigenvectors may switch if their eigenvalues are\nclose.\n\n3.2 The computation of RPF\n\nGiven a graph G, the brute-force computation of RPFS\nof PG. Therefore, the time complexity is (S \u2212 1)n3, which is quite high when S is large.\nSince only the diagonal terms of transition matrices are needed, we have ef\ufb01cient techniques. Write\n\nG requires (S\u22121)\u00d7n\u00d7n matrix multiplication\n\n,\u2200vi \u2208 VG,\u2200s = 1, 2, ..., S.\n\n(6)\n\nLet U = [(cid:126)u1, (cid:126)u2, ..., (cid:126)un], let V = U (cid:12) U, where (cid:12) denotes Hadamard product, and let (cid:126)\u039bs =\nn]T . Then we can obtain all nodes\u2019 s-step return probabilities in the vector V (cid:126)\u039bs. The\n1, \u03bbs\n[\u03bbs\n3Note that since the signs of these eigenvectors are not \ufb01xed, we use the absolute value as in [21]\n\n2, ..., \u03bbs\n\nk=1\n\n4\n\n\u2212 1\nPG = D\u22121\nG AG = D\nG )D\n\u2212 1\n\u2212 1\nG is a symmetric matrix. Then P s\nG AGD\n\n\u2212 1\nG AGD\n\nwhere BG = D\n\nbe the eigenpairs of BG, i.e., BG =(cid:80)n\nn(cid:88)\n\nk=1 \u03bbk (cid:126)uk (cid:126)uT\n\n\u2212 1\nG (D\n\n2\n\n2\n\n2\n\n2\n\n2\n\n(cid:2)(cid:126)uk(i)(cid:3)2\n\nP s\n\nG(i, i) = Bs\n\nG(i, i) =\n\n\u03bbs\nk\n\n2\n\n1\n2\n\n\u2212 1\nG BGD\nG = D\n\u2212 1\nG Bs\n\nGD\n\nG = D\n\n2\n\n1\n2\n\nG,\nG. Let {(\u03bbk, (cid:126)uk)}n\n\n1\n2\n\nk=1\n\n(5)\n\nk . Then the return probabilities are\n\n\ftime O(n2). So the total time complexity of the above computational method is O(cid:0)n3 + (S + 1)n2(cid:1).\n\neigen-decomposition of BG requires time O(n3). Computing V or V (cid:126)\u039bs, \u2200s = 1, 2, ..., S, takes\n\n3.2.1 Monte Carlo simulation method\nIf the graph node number, n, is large, i.e., n > 105, the eigendecomposition of an n \u00d7 n matrix is\nrelatively time-consuming. To make RPF scalable to large graphs, we use the Monte Carlo method to\nsimulate random walks. Given a graph G, for each node vi \u2208 VG, we can simulate a random walk\nof length S based on the transition probability matrix PG. We repeat the above procedure M times,\nobtaining M sequences of random walks. For each step s = 1, 2, ..., S, we use the relative frequency\nof returning to the starting point as the estimation of the corresponding s-step return probability. The\nrandom walk simulation is parallelizable and can be implemented ef\ufb01ciently, characteristics of which\nboth contribute to the scalability of RPF.\n\n4 Hilbert space embeddings of graphs\n\nG = {(cid:126)pi}n\n\nIn this section, we introduce the Hilbert space embeddings of graphs, based on the RPF. With such\nHilbert space embeddings, we can naturally obtain the corresponding graph kernels.\nAs discussed in Section 3,\nthe structural role of each node vi can be characterized by an\nS\u2212dimensional return probability vector (cid:126)pi (see 3), and thus a nonattributed graph can be rep-\nresented by the set RPFS\ni=1. Since the isomorphism-invariance property allows direct\ncomparison of nodes\u2019 structural roles across different graphs, we can view the RPF as a special type\nof attribute, namely, \"the structural role attribute\" (whose domain is denoted as A0), associated with\nnodes. Clearly, A0 = RS.\nThe nodes of attributed graphs usually have other types of attributes, which are obtained by physical\nmeasurements. Let A1,A2, ...,AL be their attribute domains. When combined with RPF, an\nattributed graph can be represented by the set {((cid:126)pi, a1\ni=1 \u2286 A0 \u00d7 A1 \u00d7 ... \u00d7 AL (denoted\nas \u00d7L\nl=0Al). Such a representation allows us to consider both attributed and nonattributed graphs in\na uni\ufb01ed framework, since if L = 0, the above set just degenerates to the nonattributed case. The\nl=0Al, which\nset representation forms an empirical distribution \u00b5 = 1\nn\ncan be embedded into a reproducing kernel Hilbert space (RKHS) by kernel mean embedding [11].\nLet kl, l = 0, 1, ..., L be a kernel on Al. Let Hl and \u03c6l be the corresponding RKHS and implicit\nfeature map, respectively. Then we can de\ufb01ne a kernel on A through the tensor product of kernels [28],\ni.e., k = \u2297L\nl=1 kl(al, bl). Its associated\nRKHS, H, is the tensor product space generated by Hl, i.e., H = \u2297L\nl=0Hl. Let \u03c6 : A \u2192 H be the\n\nl=0kl, k(cid:2)((cid:126)p, a1, a2, ..., aL), ((cid:126)q, b1, b2, ..., bL)(cid:3) = k0((cid:126)p, (cid:126)q)(cid:81)L\n\ni )}n\n(cid:80)n\n\ni , ..., aL\n\ni=1 \u03b4((cid:126)pi,a1\n\ni ,...,aL\n\ni ) on A = \u00d7L\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Toy graph G and its adjacent matrix; (b) Toy graph G\u2019 and its adjacent matrix; (c) 3-\nD eigenvector and RPF embeddings of nodes in G and G\u2019, respectively. We can see that our RPF\n3 , V (cid:48)\ncorrectly re\ufb02ects the structural roles. That is, the nodes V3, V4, V5 in graph G and the nodes V (cid:48)\nin graph G\u2019 have the same structural role. And the nodes V1, V2 in graph G and the nodes V (cid:48)\n5\n4 in\ngraph G\u2019 have the same structural role.\n\n1 , V (cid:48)\n2 , V (cid:48)\n\n5\n\n\fimplicit feature map. Then given a graph G, we can embed it into H in the following procedure,\n\n(cid:90)\n\nn(cid:88)\n\ni=1\n\n\u03c6d\u00b5G =\n\n1\nn\n\nA\n\n\u03c6(pi, a1\n\ni , ..., aL\n\ni ).\n\n(7)\n\nG \u2192 \u00b5G \u2192 mG, and mG =\n\n4.1 Graph kernels (I)\n\n(cid:0)(cid:52)G\n\nAn important bene\ufb01t of Hilbert space embedding of graphs is that it is straightforward to generalize\nthe positive de\ufb01nite kernels de\ufb01ned on Euclidean spaces to the set of graphs.\nGiven two graphs G and H, let {(cid:52)G\n\n(cid:1). Let KGG, KHH, and KGH be the kernel matrices,\n\nj=1 be the respective set representations\ni ,(cid:52)G\nj ),\n\ni }nG\ni ) and likewise (cid:52)H\nj ), and (KGH )ij = k((cid:52)G\n\ni=1 and {(cid:52)H\n\ni = ((cid:126)pi, a1\n\ni , ..., aL\ni ,(cid:52)H\n\ni , a2\ninduced by the embedding kernel k. That is, they are de\ufb01ned such that (KGG)ij = k((cid:52)G\n(KHH )ij = k((cid:52)H\nProposition 2. Let G be the set of graphs with attribute domains A1,A2, ...,AL. Let G and H be\ntwo graphs in G. Let mG and mH be the corresponding graph embeddings. Then the following\nfunctions are positive de\ufb01nite graph kernels de\ufb01ned on G \u00d7 G.\n\ni ,(cid:52)H\nj ).\n\ni }nH\n\nj\n\nK1(G, H) = (c + (cid:104)mG, mH(cid:105)H)d = (c +\n\nK2(G, H) = exp(\u2212\u03b3(cid:107)mG \u2212 mH(cid:107)pH) = exp(cid:2) \u2212 \u03b3MMDp(\u00b5G, \u00b5H )(cid:3), \u03b3 > 0, 0 < p \u2264 2,\n\nKGH(cid:126)1nH )d, c \u2265 0, d \u2208 N,\n\nnGnH\n\n(cid:126)1T\nnG\n\n1\n\n(8a)\n\n(8b)\n\nwhere MMD(\u00b5G, \u00b5H ) = ( 1\nn2\nG\nmaximum mean discrepancy (MMD) [11].\n\n(cid:126)1T\nnG\n\nKGG(cid:126)1nG + 1\nn2\nH\n\n(cid:126)1T\nnH\n\nKHH(cid:126)1nH \u2212 2\n\nnGnH\n\n(cid:126)1T\nnG\n\nKGH(cid:126)1nH ) 1\n\n2 is the\n\nKernel selection. In real applications, such as bioinformatics, graphs may have discrete labels\nand (multi-dimensional) real-valued attributes. Hence, three attributes domains are involved in the\ncomputation of our graph kernels: the structural role attribute domain A0, the discrete attribute domain\nAd, and the continuous attribute domain Ac. For Ad, we can use the Delta kernel kd(a, b) = I{a=b}.\nFor A0 and Ac, which are just the Euclidean spaces, we can use the Gaussian RBF kernel, the\nLaplacian RBF kernel, or the polynomial kernel.\n\n5 Approximated Hilbert space embedding of graphs\n\nBased on the above discussions, we see that obtaining a graph kernel value between each pair of\ngraphs requires calculating the inner product or the L2 distance between two Hilbert embeddings\n(see (8a) and (8b)), both of which scale quadratically to the node numbers. Such time complexity\nprecludes application to large graph datasets. To tackle the above issues, we employ the recently\nemerged approximate explicit feature maps [23].\nFor a kernel kl on the attribute domain Al, l = 0, 1, ..., L, we \ufb01nd an explicit map \u02c6\u03c6 : Al \u2192 RDl, so\nthat\n\n\u2200a, b \u2208 Al,(cid:104) \u02c6\u03c6(a), \u02c6\u03c6(b)(cid:105) = \u02c6kl(a, b), and \u02c6kl(a, b) \u2192 kl(a, b) as Dl \u2192 \u221e.\n\n(9)\nThe explicit feature maps will be directly used to compute the approximate graph embeddings, by\nvirtue of tensor algebra (see Section 2.2). The following theorem says that the approximate explicit\ngraph embeddings can be written as the linear combination of rank-one tensors.\nTheorem 2. Let G and H be any two graphs in G.\nLet {((cid:126)pi, a1\ni=1 and\ni , a2\n{((cid:126)qj, b1\nj=1 be the respective set representations of G and H. Then their approxi-\nmate explicit graph embeddings, \u02c6mG and \u02c6mH, are tensors in RD0\u00d7D1\u00d7...\u00d7DL, and can be written as\n\nj )}nH\n\ni )}nG\n\ni , ..., aL\n\nj , ..., bL\n\nj , b2\n\n\u02c6mG =\n\n1\nnG\n\n\u02c6\u03c60((cid:126)pi)\u25e6 \u02c6\u03c61(a1\n\ni )\u25e6 ...\u25e6 \u02c6\u03c6L(aL\n\ni ), \u02c6mH =\n\n1\nnH\n\n\u02c6\u03c60((cid:126)qj)\u25e6 \u02c6\u03c61(b1\n\nj )\u25e6 ...\u25e6 \u02c6\u03c6L(bL\n\nj ). (10)\n\nThat is, as D0, D1, ..., DL \u2192 \u221e, we have (cid:104) \u02c6mG, \u02c6mH(cid:105)T \u2192 (cid:104)mG, mH(cid:105)H.\n\n6\n\nnG(cid:88)\n\ni=1\n\nnH(cid:88)\n\nj=1\n\n\f5.1 Graph Kernels (II)\n\n(cid:80)nG\n\ni=1\n\nWith approximate tensor embeddings (10), we obtain new graph kernels.\nProposition 3. The following functions are positive de\ufb01nite graph kernels de\ufb01ned on G \u00d7 G.\n\n\u02c6K1(G, H) = (c + (cid:104) \u02c6mG, \u02c6mH(cid:105)T )d =(cid:2)c + vec( \u02c6mG)T vec( \u02c6mH)(cid:3)d\n\n, c \u2265 0, d \u2208 N,\n\n\u02c6K2(G, H) = exp(\u2212\u03b3(cid:107) \u02c6mG \u2212 \u02c6mH(cid:107)pT ) = exp(\u2212\u03b3(cid:107)vec( \u02c6mG) \u2212 vec( \u02c6mH )(cid:107)p\n\n(11a)\n2), \u03b3 > 0, 0 < p \u2264 2..\n(11b)\nMoreover, as D0, D1, ..., DL \u2192 \u221e, we have \u02c6K1(G, H) \u2192 K1(G, H) and \u02c6K2(G, H) \u2192 K2(G, H).\nThe vectorization of \u02c6mG (or \u02c6mH) can be easily implemented by the Kronecker product, i.e.,\ni ). To obtain above graph kernels, we need only\nvec( \u02c6mG) = 1\nnG\nto compute the Euclidean inner product or distance between vectors. More notably, the size of the\ntensor representation does not depends on node numbers, making it scalable to large graphs.\nApproximate explicit feature map selection. For the Delta kernel on the discrete attribute domain,\nwe directly use the one-hot vector. For shift-invariant kernels, i.e., k((cid:126)x, (cid:126)y) = k((cid:126)x\u2212 (cid:126)y), on Euclidean\nspaces, e.g., A0 and Ac, we make use of random Fourier feature map [23], \u02c6\u03c6 : Rd \u2192 RD, satisfying\n(cid:104) \u02c6\u03c6((cid:126)x), \u02c6\u03c6((cid:126)y)(cid:105) \u2248 k((cid:126)x, (cid:126)y). To do this, we \ufb01rst draw D i.i.d. samples \u03c91, \u03c92, ..., \u03c9D from a proper\ndistribution p(\u03c9). (Note that in this paper, we use p(\u03c9) =\n2\u03c32 ).) Next, we draw\nD i.i.d. samples b1, b2, ..., bD from the uniform distribution on [0, 2\u03c0]. Finally, we can calculate\n\u02c6\u03c6((cid:126)x) =\n\nD (cid:126)x + bD)(cid:3)T \u2208 RD.\n\n2\u03c0\u03c3)D exp(\u2212(cid:107)\u03c9(cid:107)2\n\n\u02c6\u03c60((cid:126)pi)\u2297 \u02c6\u03c61(a1\n\ni )\u2297 ...\u2297 \u02c6\u03c6L(aL\n\n(cid:2) cos(\u03c9T\n\n1 (cid:126)x + b1), ..., cos(\u03c9T\n\n(cid:113) 2\n\nD\n\n1\n\n\u221a\n(\n\n2\n\n6 Experiments\n\nIn this section, we conduct extensive experiments to demonstrate the effectiveness of our graph\nkernels. We run all the experiments on a laptop with an Intel i7-7820HQ, 2.90GHz CPU and 64GB\nRAM. We implement our algorithms in Matlab, except for the Monte Carlo based computation of\nRPF (see Section 3.2,1), which is implemented in C++.\n\n6.1 Datasets\n\nWe conduct graph classi\ufb01cation on four types of benchmark datasets [16]. (i) Non-attributed (unla-\nbeled) graphs datasets: COLLAB, IMDB-BINARY, IMDB-MULTI, REDDIT-BINARY, REDDIT-\nMULTI(5K), and REDDIT-MULTI(12K) [31] are generated from social networks. (ii) Graphs with\ndiscrete attributes (labels): DD [8] are proteins. MUTAG [7], NCI1 [25], PTC-FM, PTC-FR, PTC-\nMM, and PTC-MR [13] are chemical compounds. (iii) Graphs with continuous attributes: FRANK is\na chemical molecule dataset [15]. SYNTHETIC and Synthie are synthetic datasets based on random\ngraphs, which were \ufb01rst introduced in [9] and [19], respectively. (iv) Graphs with both discrete\nand continuous attributes: ENZYMES and PROTEINS [2] are graph representations of proteins.\nBZR, COX2, and DHFR [27] are chemical compounds. Detailed descriptions, including statistical\nproperties, of these 21 datasets are provided in the supplementary material.\n\n6.2 Experimental setup\n\nWe demonstrate both the graph kernels (I) and (II) introduced in Section 4.1 and Section 5.1, which are\ndenoted by RetGKI and RetGKII, respectively. The Monte Carlo computation of return probability\nfeatures, denoted by RetGKII(MC), is also considered. In our experiments, we repeat 200 Monte\nCarlo trials, i.e., M = 200, for obtaining RPF. For handling the isolated nodes, whose degrees are\nzero, we arti\ufb01cially add a self-loop for each node in graphs.\nParameters. In all experiments, we set the random walk step S = 50. For RetGKI, we use the\nLaplacian RBF kernel for both the structural role domain A0, and the continuous attribute domain Ac,\ni.e., k0((cid:126)p, (cid:126)q) = exp(\u2212\u03b30(cid:107)(cid:126)p \u2212 (cid:126)q(cid:107)2) and kc((cid:126)a, (cid:126)b) = exp(\u2212\u03b3c(cid:107)(cid:126)a \u2212 (cid:126)b(cid:107)2). We set \u03b30 to be the inverse\nof the median of all pairwise distances, and set \u03b3c to be the inverse of the square root of the attributes\u2019\ndimension, except for the FRANK dataset, whose \u03b3c is set to be the recommended value\n0.0073\n\n\u221a\n\n7\n\n\fdistp , where dist is the median of all the pairwise graph embedding distances.\n\nin the paper [22] and [19]. For RetGKII, on the \ufb01rst three types of graphs, we set the dimensions\nof random Fourier feature maps on A0 and Ac both to be 200, i.e., D0 = Dc = 200, except for the\nFRANK dataset, whose Dc is set to be 500 because its attributes lie in a much higher dimensional\nspace. On the graphs with both discrete and continuous attributes, for the sake of computational\nef\ufb01ciency, we set D0 = Dc = 100. For both RetGKI and RetGKII, we make use of the graph\nkernels with exponential forms, exp(\u2212\u03b3(cid:107) \u00b7 (cid:107)p), (see (8b) and (11b)). We select p from {1, 2}, and\nset \u03b3 = 1\nWe compare our graph kernels with many state-of-the-art graph classi\ufb01cation algorithms: (i) the\nshortest path kernel (SP) [1], (ii) the Weisfeiler-Lehman subtree kernel (WL) [25], (iii) the graphlet\ncount kernel (GK)[26], (iv) deep graph kernels (DGK) [31], (v) PATCHY-SAN convolutional neural\nnetwork (PSCN) [20], (vi) deep graph convolutional neural network (DGCNN) [33], (vii) graph\ninvariant kernels (GIK) [22], (viii) hashing Weisfeiler-Lehman graph kernels (HGK(WL)) [19], and\n(IX) subgraph matching kernels (CSM) [18].\nFor all kinds of graph kernels, we employ SVM [3] as the \ufb01nal classi\ufb01er. The tradeoff parameter C\nis selected from {10\u22123, 10\u22122, 10\u22121, 1, 10, 102, 103}. We perform 10-fold cross-validations, using\n9 folds for training and 1 for testing, and repeat the experiments 10 times. We report average\nclassi\ufb01cation accuracies and standard errors.\n\n6.3 Experimental Results\n\nThe classi\ufb01cation results4 on four types of datasets are shown in Tables 1, 2, 3, and 4. The best\nresults are highlighted in bold. We also report the total time of computing the graph kernels of all the\ndatasets in each table. It can be seen that graph kernels RetGKI and RetGKII both achieve superior\nor comparable performance on all the benchmark datasets. Especially on the datasets COLLAB,\nREDDIT-BINARY, REDDIT-MULTI(12K), Synthie, BZR, COX2, our approaches signi\ufb01cantly\noutperform other state-of-the-art algorithms. The classi\ufb01cation accuracies of our approaches on\nthese datasets are at least six percentage points higher than those of the best baseline algorithms.\nMoreover, we see that RetGKII and RetGKII(MC) are faster than baseline methods. Their running\ntimes remain perfectly practical. On the large social network datasets (see Table 1), RetGKII(MC)\nis almost one order of magnitude faster than the Weisfeiler-Lehman subtree kernel, which is well\nknown for its computational ef\ufb01ciency.\n\n6.4 Sensitivity analysis\n\nHere, we conduct a parameter sensitivity analysis of RetGKII on the datasets REDDIT-BINARY,\nNCI1, SYNTHETIC, Synthie, ENZYMES, and PROTEINS. We test the stability of RetGKII by\nvarying the values of the random walk steps S, the dimension D0 of the approximate explicit feature\nmap on A0, and the dimension Dc of the feature map on Ac. We plot the average classi\ufb01cation\naccuracy of ten repetitions of 10-fold cross-validations with respect to S, D0, and Dc in Fig. 3. It can\nbe concluded that RetGKII performs consistently across a wide range of parameter values.\n\nTable 1: Classi\ufb01cation results (in %) for non-attributed (unlabeled) graph datasets\n\nDatasets\nCOLLAB\n\nIMDB-BINARY\nIMDB-MULTI\n\nREDDIT-BINARY\n\nREDDIT-MULTI(5K)\nREDDIT-MULTI(12K)\n\nTotal time\n\nWL\n\n74.8(0.2)\n70.8(0.5)\n49.8(0.5)\n68.2(0.2)\n51.2(0.3)\n32.6(0.3)\n\n2h3m\n\nGK\n\n72.8(0.3)\n65.9(1.0)\n43.9(0.4)\n77.3(0.2)\n41.0(0.2)\n31.8(0.1)\n\n\u2013\n\nDGK\n\n73.1(0.3)\n67.0(0.6)\n44.6(0.5)\n78.0(0.4)\n41.3(0.2)\n32.2(0.1)\n\n\u2013\n\nPSCN\n72.6(2.2)\n71.0(2.3)\n45.2(2.8)\n86.3(1.6)\n49.1(0.7)\n41.3(0.4)\n\n\u2013\n\nRetGKI\n81.0(0.3)\n71.9(1.0)\n47.7(0.3)\n92.6(0.3)\n56.1(0.5)\n48.7(0.2)\n48h14m\n\nRetGKII\n80.6(0.3)\n72.3(0.6)\n48.7(0.6)\n91.6(0.2)\n55.3(0.3)\n47.1(0.3)\n17m14s\n\nRetGKII(MC)\n\n73.6(0.3)\n71.0(0.6)\n46.7(0.6)\n90.8(0.2)\n54.2(0.3)\n45.9(0.2)\n\n6m9s\n\n4The accuracies of WL, SP and GK are obtained from our own experiments. For others competing algorithms,\n\nwe directly quote the values from their papers.\n\n8\n\n\fTable 2: Classi\ufb01cation results (in %) for graph datasets with discrete attributes\n\nDatasets\n\nENZYMES\nPROTEINS\nMUTAG\n\nDD\nNCI1\n\nPTC-FM\nPTC-FR\nPTC-MM\nPTC-MR\nTotal time\n\nSP\n\n38.6(1.5)\n73.3(0.9)\n85.2(2.3)\n\n>24h\n\n74.8(0.4)\n60.5(1.7)\n61.6(1.0)\n62.9(1.4)\n57.8(2.1)\n\n>24h\n\nWL\n\n53.4(0.9)\n71.2(0.8)\n84.4(1.5)\n78.6(0.4)\n85.4(0.3)\n55.2(2.3)\n63.9(1.4)\n60.6(1.1)\n55.4(1.5)\n2m27s\n\nGK\n\u2013\n\n71.7(0.6)\n81.6(2.1)\n78.5(0.3)\n62.3(0.3)\n\n\u2013\n\u2013\n\u2013\n\n\u2013\n\nCSM\n\n60.4(1.6)\n\n85.4(1.2)\n\n63.8(1.0)\n65.5(1.4)\n63.3(1.7)\n58.1(1.6)\n\n\u2013\n\n\u2013\n\u2013\n\n\u2013\n\n57.3(1.1)\n\n58.6(2.5)\n\n60.1(2.6)\n\n62.3(5.7)\n\nDGCNN\n\n75.5(0.9)\n85.8(1.7)\n79.4(0.9)\n74.4(0.5)\n\nDGK\n\n53.4(0.9)\n75.7(0.5)\n87.4(2.7)\n\n80.3(0.5)\n\nPSCN\n\n\u2013\n\n75.0(2.5)\n89.0(4.4)\n76.2(2.6)\n76.3(1.7)\n\n\u2013\n\n\u2013\n\u2013\n\u2013\n\n\u2013\n\n\u2013\n\u2013\n\u2013\n\n\u2013\n\n\u2013\n\n\u2013\n\u2013\n\u2013\n\n\u2013\n\nRetGKI\n60.4(0.8)\n75.8(0.6)\n90.3(1.1)\n81.6(0.3)\n84.5(0.2)\n62.3(1.0)\n66.7(1.4)\n65.6(1.1)\n62.5(1.6)\n38m4s\n\nRetGKII\n59.1(1.1)\n75.2(0.3)\n90.1(1.0)\n81.0(0.5)\n83.5(0.2)\n63.9(1.3)\n67.8(1.1)\n67.9(1.4)\n62.1(1.5)\n\n49.9s\n\nFigure 3: Parameter sensitivity study for RetGKII on six benchmark datasets\n\nTable 3: Classi\ufb01cation results (in %) for\ngraph datasets with continuous attributes\n\nTable 4: Classi\ufb01cation results (in %) for graph\ndatasets with both discrete and continuous attributes\n\nDatasets\n\nENZYMES\nPROTEINS\n\nFRANK\n\nSYNTHETIC\n\nSynthie\nTotal time\n\nHGK(WL)\n63.9(1.1)\n74.9(0.6)\n73.2(0.3)\n97.6(0.4)\n80.3(1.4)\n\n\u2013\n\nRetGKI\n70.0(0.9)\n76.2(0.5)\n76.4(0.3)\n97.9(0.3)\n97.1(0.3)\n45m30s\n\nRetGKII\n70.7(0.9)\n75.9(0.4)\n76.7(0.4)\n98.9(0.4)\n96.2(0.3)\n\n40.8s\n\nDatasets\n\nENZYMES\nPROTEINS\n\nBZR\nCOX2\nDHFR\n\nTotal time\n\nGIK\n\n71.7(0.8)\n76.1(0.3)\n\n\u2013\n\u2013\n\u2013\n\u2013\n\nCSM\n\n69.8(0.7)\n\n79.4(1.2)\n74.4(1.7)\n79.9(1.1)\n\n\u2013\n\n\u2013\n\nRetGKI\n72.2(0.8)\n78.0(0.3)\n86.4(1.2)\n80.1(0.9)\n81.5(0.9)\n4m17s\n\nRetGKII\n70.6(0.7)\n77.3(0.5)\n87.1(0.7)\n81.4(0.6)\n82.5(0.8)\n2m51s\n\n7 Conclusion\n\nIn this paper, we introduced the return probability feature for characterizing and comparing the\nstructural role of nodes across graphs. Based on the RPF, we embedded graphs in an RKHS\nand derived the corresponding graph kernels RetGKI. Then, making use of approximate explicit\nfeature maps, we represented each graph with a multi-dimensional tensor, and then obtained the\ncomputationally ef\ufb01cient graph kernels RetGKII. We applied RetGKI and RetGKII to classify\ngraphs, and achieved promising results on many benchmark datasets. Given the prevalence of\nstructured data, we believe that our work can be potentially useful in many applications.\n\n8 Acknowledgement\n\nThis work was supported in part by the AFOSR grant FA9550-16-1-0386.\n\nReferences\n[1] Karsten M Borgwardt and Hans-Peter Kriegel. Shortest-path kernels on graphs. In Data Mining,\n\nFifth IEEE International Conference on, pages 8\u2013pp. IEEE, 2005.\n\n9\n\n102030405060708090100S6065707580859095100classification accuracy (%)105010020050010002000D06065707580859095100classification accuracy (%)REDDIT-BINARYNCI1SYNTHETICSynthieENZYMESPROTEINS105010020050010002000Dc6065707580859095100classification accuracy (%)\f[2] Karsten M Borgwardt, Cheng Soon Ong, Stefan Sch\u00f6nauer, SVN Vishwanathan, Alex J\nSmola, and Hans-Peter Kriegel. Protein function prediction via graph kernels. Bioinformatics,\n21(suppl_1):i47\u2013i56, 2005.\n\n[3] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM\n\ntransactions on intelligent systems and technology (TIST), 2(3):27, 2011.\n\n[4] Fan RK Chung. Spectral graph theory. Number 92. American Mathematical Soc., 1997.\n\n[5] Erhan Cinlar. Introduction to stochastic processes. Courier Corporation, 2013.\n\n[6] Giovanni Da San Martino, Nicol\u00f2 Navarin, and Alessandro Sperduti. Tree-based kernel for\ngraphs with continuous attributes. IEEE transactions on neural networks and learning systems,\n2017.\n\n[7] Asim Kumar Debnath, Rosa L Lopez de Compadre, Gargi Debnath, Alan J Shusterman, and\nCorwin Hansch. Structure-activity relationship of mutagenic aromatic and heteroaromatic\nnitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of\nmedicinal chemistry, 34(2):786\u2013797, 1991.\n\n[8] Paul D Dobson and Andrew J Doig. Distinguishing enzyme structures from non-enzymes\n\nwithout alignments. Journal of molecular biology, 330(4):771\u2013783, 2003.\n\n[9] Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne, and Karsten Borgwardt.\nScalable kernels for graphs with continuous attributes. In Advances in Neural Information\nProcessing Systems, pages 216\u2013224, 2013.\n\n[10] Thomas G\u00e4rtner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and\nef\ufb01cient alternatives. In Learning Theory and Kernel Machines, pages 129\u2013143. Springer, 2003.\n\n[11] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander\nSmola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773,\n2012.\n\n[12] David Haussler. Convolution kernels on discrete structures. Technical report, Technical report,\n\nDepartment of Computer Science, University of California at Santa Cruz, 1999.\n\n[13] Christoph Helma, Ross D. King, Stefan Kramer, and Ashwin Srinivasan. The predictive\n\ntoxicology challenge 2000\u20132001. Bioinformatics, 17(1):107\u2013108, 2001.\n\n[14] Jaz Kandola, Thore Graepel, and John Shawe-Taylor. Reducing kernel matrix diagonal dom-\ninance using semi-de\ufb01nite programming. In Learning Theory and Kernel Machines, pages\n288\u2013302. Springer, 2003.\n\n[15] Jeroen Kazius, Ross McGuire, and Roberta Bursi. Derivation and validation of toxicophores for\n\nmutagenicity prediction. Journal of Medicinal Chemistry, 48(1):312\u2013320, 2005.\n\n[16] Kristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann.\nBenchmark data sets for graph kernels, 2016. http://graphkernels.cs.tu-dortmund.de.\n\n[17] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review,\n\n51(3):455\u2013500, 2009.\n\n[18] Nils Kriege and Petra Mutzel. Subgraph matching kernels for attributed graphs. In ICML, 2012.\n\n[19] Christopher Morris, Nils M Kriege, Kristian Kersting, and Petra Mutzel. Faster kernels for\nIn Data Mining (ICDM), 2016 IEEE 16th\n\ngraphs with continuous attributes via hashing.\nInternational Conference on, pages 1095\u20131100. IEEE, 2016.\n\n[20] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural\nnetworks for graphs. In International conference on machine learning, pages 2014\u20132023, 2016.\n\n[21] Giannis Nikolentzos, Polykarpos Meladianos, and Michalis Vazirgiannis. Matching node\n\nembeddings for graph similarity. In AAAI, pages 2429\u20132435, 2017.\n\n10\n\n\f[22] Francesco Orsini, Paolo Frasconi, and Luc De Raedt. Graph invariant kernels. In Proceedings\nof the Twenty-fourth International Joint Conference on Arti\ufb01cial Intelligence, pages 3756\u20133762,\n2015.\n\n[23] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2008.\n\n[24] Tiago A Schieber, Laura Carpi, Albert D\u00edaz-Guilera, Panos M Pardalos, Cristina Masoller, and\nMart\u00edn G Ravetti. Quanti\ufb01cation of network structural dissimilarities. Nature communications,\n8:13928, 2017.\n\n[25] Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M\nJournal of Machine Learning Research,\n\nBorgwardt. Weisfeiler-lehman graph kernels.\n12(Sep):2539\u20132561, 2011.\n\n[26] Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten Borgwardt.\nEf\ufb01cient graphlet kernels for large graph comparison. In Arti\ufb01cial Intelligence and Statistics,\npages 488\u2013495, 2009.\n\n[27] Jeffrey J Sutherland, Lee A O\u2019brien, and Donald F Weaver. Spline-\ufb01tting with a genetic\nalgorithm: A method for developing classi\ufb01cation structure- activity relationships. Journal of\nchemical information and computer sciences, 43(6):1906\u20131915, 2003.\n\n[28] Zolt\u00e1n Szab\u00f3 and Bharath K Sriperumbudur. Characteristic and universal tensor product kernels.\n\narXiv preprint arXiv:1708.08157, 2017.\n\n[29] Saurabh Verma and Zhi-Li Zhang. Hunt for the unique, stable, sparse and fast feature learning\n\non graphs. In Advances in Neural Information Processing Systems, pages 87\u201397, 2017.\n\n[30] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph\n\nkernels. Journal of Machine Learning Research, 11(Apr):1201\u20131242, 2010.\n\n[31] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21th\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages\n1365\u20131374. ACM, 2015.\n\n[32] Pinar Yanardag and SVN Vishwanathan. A structural smoothing framework for robust graph\ncomparison. In Advances in Neural Information Processing Systems, pages 2134\u20132142, 2015.\n\n[33] Muhan Zhang, Zhicheng Cui, Marion Neumann, and Yixin Chen. An end-to-end deep learning\narchitecture for graph classi\ufb01cation. In Proceedings of AAAI Conference on Arti\ufb01cial Inteligence,\n2018.\n\n11\n\n\f", "award": [], "sourceid": 1960, "authors": [{"given_name": "Zhen", "family_name": "Zhang", "institution": "WASHINGTON UNIVERSITY IN ST.LOUIS"}, {"given_name": "Mianzhi", "family_name": "Wang", "institution": "Washington University in St. Louis"}, {"given_name": "Yijian", "family_name": "Xiang", "institution": "Washington University in St. Louis"}, {"given_name": "Yan", "family_name": "Huang", "institution": "Washington University in St. Louis"}, {"given_name": "Arye", "family_name": "Nehorai", "institution": "WASHINGTON UNIVERSITY IN ST.LOUIS"}]}