{"title": "KONG: Kernels for ordered-neighborhood graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 4051, "page_last": 4060, "abstract": "We present novel graph kernels for graphs with node and edge labels that have ordered neighborhoods, i.e. when neighbor nodes follow an order. Graphs with ordered neighborhoods are a natural data representation for evolving graphs where edges are created over time, which induces an order. Combining convolutional subgraph kernels and string kernels, we design new scalable algorithms for generation of explicit graph feature maps using sketching techniques. We obtain precise bounds for the approximation accuracy and computational complexity of the proposed approaches and demonstrate their applicability on real datasets. In particular, our experiments demonstrate that neighborhood ordering results in more informative features. For the special case of general graphs, i.e. graphs without ordered neighborhoods, the new graph kernels yield efficient and simple algorithms for the comparison of label distributions between graphs.", "full_text": "KONG: Kernels for ordered-neighborhood graphs\n\nKonstantin Kutzkov2\n\nKevin Scaman1\n\nMilan Vojnovic2\n\nMoez Draief1\n\n1 Huawei Noah\u2019s Ark Lab\n\n2 London School of Economics, London\n\nmoez.draief@huawei.com, kutzkov@gmail.com (Corresponding author),\n\nkevin.scaman@huawei.com, m.vojnovic@lse.ac.uk\n\nAbstract\n\nWe present novel graph kernels for graphs with node and edge labels that have\nordered neighborhoods, i.e. when neighbor nodes follow an order. Graphs with\nordered neighborhoods are a natural data representation for evolving graphs where\nedges are created over time, which induces an order. Combining convolutional\nsubgraph kernels and string kernels, we design new scalable algorithms for gen-\neration of explicit graph feature maps using sketching techniques. We obtain\nprecise bounds for the approximation accuracy and computational complexity of\nthe proposed approaches and demonstrate their applicability on real datasets. In\nparticular, our experiments demonstrate that neighborhood ordering results in more\ninformative features. For the special case of general graphs, i.e., graphs without\nordered neighborhoods, the new graph kernels yield ef\ufb01cient and simple algorithms\nfor the comparison of label distributions between graphs.\n\n1\n\nIntroduction\n\nGraphs are ubiquitous representations for structured data and have found numerous applications in\nmachine learning and related \ufb01elds, ranging from community detection in online social networks [For-\ntunato, 2010] to protein structure prediction [Rual et al., 2005]. Unsurprisingly, learning from graphs\nhas attracted much attention from the research community. Graphs kernels have become a standard\ntool for graph classi\ufb01cation [Kriege et al., 2017]. Given a large collection of graphs, possibly with\nnode and edge attributes, we are interested in learning a kernel function that best captures the similar-\nity between any two graphs. The graph kernel function can be used to classify graphs using standard\nkernel methods such as support vector machines.\nGraph similarity is a broadly de\ufb01ned concept and therefore many different graph kernels with different\nproperties have been proposed. Previous works have considered graph kernels for different graph\nclasses distinguishing between simple unweighted graphs without node or edge attributes, graphs with\ndiscrete node and edge labels, and graphs with more complex attributes such as real-valued vectors\nand partial labels. For evolving graphs, the ordering of the node neighborhoods can be indicative for\nthe graph class. Concrete examples include graphs that describe user web browsing patterns, evolving\nnetworks such as social graphs, product purchases and reviews, ratings in recommendation systems,\nco-authorship networks, and software API calls used for malware detection.\nThe order in which edges are created can be informative about the structure of the original data. To\nthe best of our knowledge, existing graph kernels do not consider this aspect. Addressing the gap, we\npresent a novel framework for graph kernels where the edges adjacent to a node follow speci\ufb01c order.\nThe proposed algorithmic framework KONG, referring to Kernels for Ordered-Neighborhood Graphs,\naccommodates highly ef\ufb01cient algorithms that scale to both massive graphs and large collections\nof graphs. The key ideas are: (a) representation of each node neighborhood by a string using a tree\ntraversal method, and (b) ef\ufb01cient computation of explicit graph feature maps based on generating k-\ngram frequency vectors of each node\u2019s string without explicitly storing the strings. The latter enables\nto approximate the explicit feature maps of various kernel functions using sketching techniques.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fExplicit feature maps correspond to k-gram frequency vectors of node strings, and sketching amounts\nto incrementally computing sketches of these frequency vectors. The proposed algorithms allow for\n\ufb02exibility in the choice of the string kernel and the tree traversal method. In Figure 1 we present a\ndirected labeled subgraph rooted at node v. A breadth \ufb01rst-search traversal would result in the string\nABCDGEF HG but other traversal approaches might yield more informative strings.\n\nG\n\nC\n\nB\n\nD\n\nF\n\nE\n\nv\n\nA\n\nH\n\nFigure 1: An illustrative exam-\nple of an order-neighborhood\ngraph: the neighhbor order is\nthe letter alphabetical order.\n\nRelated work Many graph kernels with different properties have\nbeen proposed in the literature. Most of them work with implicit\nfeature maps and compare pairs of graphs, we refer to [Kriege et al.,\n2017] for a study on implicit and explicit graph feature maps.\nMost related to our work is the Weisfeiler-Lehman kernel [Sher-\nvashidze et al., 2011] that iteratively traverses the subtree rooted at\neach node and collects the corresponding labels into a string. Each\nstring is sorted and the strings are compressed into unique integers\nwhich become the new node labels. After h iterations we have a\nlabel at each node. The convolutional kernel that compares all pairs\nof node labels using the Dirac kernel (indicator of an exact match of\nnode labels) is equivalent to the inner product of the label distribu-\ntions. However, this kernel might suffer from diagonal dominance\nwhere most nodes have unique labels and a graph is similar to itself\nbut not to other graphs in the dataset. The shortcoming was addressed in [Yanardag and Vishwanathan,\n2015]. The kernel between graphs G and G(cid:48) is computed as \u03ba(G, G(cid:48)) = \u03a6(G)TM\u03a6(G(cid:48)) where M\nis a pre-computed matrix that measures the similarity between labels. The matrix can become huge\nand the approach is not applicable to large-scale graphs. While the Weisfeiler-Lehman kernel applies\nto ordered neighborhoods, for large graphs it is likely to result in many unique strings and comparing\nthem with the Dirac kernel might yield poor results, both in terms of accuracy and scalability.\nIn a recent work [Manzoor et al., 2016] presented an unsupervised learning algorithm that generates\nfeature vectors from labeled graphs by traversing the neighbor edges in a prede\ufb01ned order. Even if\nnot discussed in the paper, the generated vectors correspond to explicit feature maps for convolu-\ntional graph kernels with Dirac base kernels. Our approach provides a highly-scalable algorithmic\nframework that allows for different base kernels and different tree traversal methods.\nThe idea of ordered neighborhood was also used in the context of kernels for general graphs.\nIn Martino et al. [2012] it is proposed to decompose the original graph into directed acyclic graphs\n(DAGs) and then de\ufb01ne an order on the DAG nodes which is then used to generate relevant features.\nThe approach was extended to learning the features from graph streams in Martino et al. [2013]. The\napproaches are not really applicable to our setting as they have a different objective, namely to exploit\nthe structure of general graphs. Also, the DAG generation leads to high computational time.\nAnother line of research related to our work presents algorithms for learning graph vector\nrepresentations [Perozzi et al., 2014, Grover and Leskovec, 2016, Niepert et al., 2016]. Given a\ncollection of labeled graphs, the goal is to map the graphs (or their nodes) to a feature space that best\nrepresents the graph structure. These approaches are powerful and yield the state-of-the-art results\nbut they involve the optimization of complex objective functions and do not scale to massive graphs.\n\nContributions The contributions of this paper can be summarized as follows:\n\u2022 To the best of our knowledge, this is the \ufb01rst work to focus and formally de\ufb01ne graph kernels for\ngraphs with ordered node neighborhoods. Extending upon string kernels, we present and formally\nanalyse a family of graph kernels that can be applied to different problems. The KONG algorithms\nare ef\ufb01cient with respect to two parameters, the total number of graphs N and the total number\nof edges M. We propose approaches that compute an explicit feature map for each graph which\nenables the use of linear SVMs for graph classi\ufb01cation, thus avoiding the computation of a kernel\nmatrix of size O(N 2). Leveraging advanced sketching techniques, an approximation of the explicit\nfeature map for a graph with m edges can be computed in time and space O(m) or a total O(M ).\nWe also present an extension to learning from graph streams using sublinear space o(M ).1\n\n1Software implementation and data are available at https://github.com/kutzkov/KONG.\n\n2\n\n\f\u2022 For general labeled graphs without neighbor ordering our approach results in new graph kernels\nthat compare the label distribution of subgraphs using widely used kernels such as the polynomial\nand cosine kernels. We argue that the approach can be seen as an ef\ufb01cient smoothing algorithm for\nnode labelling kernels such as the Weisfeiler-Lehman kernel. An experimental evaluation on real\ngraphs shows that the proposed kernels are competitive with state-of-the-art kernels, achieving\nbetter accuracy on some benchmark datasets and using compact feature maps.\n\n\u2022 The presented approach can be viewed as an ef\ufb01cient algorithm for learning compact graph\nrepresentations. The primary focus of the approach is on learning explicit feature maps for a\nclass of base kernels for the convolutional graph kernel. However, the algorithms learn vector\nembeddings that can be used by other machine learning algorithms such as logistic regression,\ndecision trees and neural networks as well as unsupervised methods.\n\nPaper outline The paper is organised as follows. In Section 2 we discuss previous work, provide\nmotivating examples and introduce general concepts and notation. In Section 3 we \ufb01rst give a general\noverview of the approach and discuss string generation and string kernels and then present theoretical\nresults. Experimental evaluation is presented in Section 4. We conclude in Section 5.\n\n2 Preliminaries\n\nNotation and problem formulation The input is a collection G of tuples (Gi, yi) where Gi is a\ngraph and yi is a class. Each graph is de\ufb01ned as G = (V, E, (cid:96), \u03c4 ) where (cid:96) : V \u2192 L is a labelling\nfunction for a discrete set L and \u03c4 de\ufb01nes ordering of node neighborhoods. We consider only node\nlabels but all presented algorithms naturally apply to edge labels as well. The neighborhood of a node\nv \u2208 V is Nv = {u \u2208 V : (v, u) \u2208 E}. The ordering function \u03c4v : Nv \u2192 \u03a0(Nv) de\ufb01nes a \ufb01xed\norder permutation on Nv, where \u03a0(Nv) denotes the set of all permutations of the elements of Nv.\nNote that the order is local, i.e., two nodes can have different orderings for same neighborhood sets.\nKernels, feature maps and linear support vector machines A function \u03ba : X \u00d7 X \u2192 R is\na valid kernel if \u03ba(x, y) = \u03ba(y, x) for x, y \u2208 X and the kernel matrix K \u2208 Rm\u00d7m de\ufb01ned by\nK(i, j) = \u03ba(xi, xj) for any x1, . . . , xm \u2208 X is positive semide\ufb01nite. If the function \u03ba(x, y) can be\nrepresented as \u03c6(x)T \u03c6(y) for an explicit feature map \u03c6 : X \u2192 Y where Y is an inner product feature\nspace, then \u03ba is a valid kernel. Also, a linear combination of kernels is a kernel. Thus, if the base\nkernel is valid, then the convolutional kernel is also valid.\nWe will consider base kernels where X = Rn and \u03c6 : Rn \u2192 RD. Note that D can be very large\nor even in\ufb01nite. The celebrated kernel trick circumvents this limitation by computing the kernel\nfunction for all support vectors. But this means that for training one needs to explicitly compute\na kernel matrix of size N 2 for N input examples. Also, in large-scale applications, the number of\nsupport vectors often grows linearly and at prediction time one needs to evaluate the kernel function\nfor O(N ) support vectors. In contrast, linear support vector machines [Joachims, 2006], where the\nkernel is the vector inner product, run in linear time of the number of examples and prediction needs\nO(D) time. An active area of research has been the design of scalable algorithms that compute\nlow-dimensional approximation of the explicit feature map z : RD \u2192 Rd such that d (cid:28) D and\n\u03ba(x, y) \u2248 z(\u03c6(x))T z(\u03c6(y)) [Rahimi and Recht, 2007, Le et al., 2013, Pham and Pagh, 2013].\nConvolutional graph kernels Most known graph kernels are instances of the family of convo-\nlutional kernels [Haussler, 1999].\nIn their simpli\ufb01ed form, the convolutional kernels work by\ndecomposing a given graph G into a set of (possibly overlapping) substructures \u0393(G). For example,\n\u0393(G) can be the set of 1-hop subtrees rooted at each node. The kernel between two graphs G and H is\ng\u2208\u0393(G),h\u2208\u0393(H) \u03ba(g, h) where \u03ba(g, h) is a base kernel comparing the parts\ng and h. For example, \u03ba can be the inner product kernel comparing the label distribution of the two\nsubtrees. Known graph kernels differ mainly in the way the graph is decomposed. Notable examples\ninclude the random walk kernel [G\u00e4rtner et al., 2003], the shortest path kernel [Borgwardt and Kriegel,\n2005], the graphlet kernel [Shervashidze et al., 2009] and the Weisfeiler-Lehman kernel [Shervashidze\net al., 2011]. The base kernel is usually the Dirac kernel comparing the parts g and h for equality.\nBuilding upon ef\ufb01cient sketching algorithms, we will compute explicit graph feature maps. More\nprecisely, let \u03c6\u03ba be the explicit feature map of the base kernel \u03ba. An explicit feature map \u03a6\u03ba is\n\nde\ufb01ned as K(G, H) =(cid:80)\n\n3\n\n\fde\ufb01ned such that for any two graphs G and H:\n\n(cid:88)\n\nK(G, H) =\n\n\u03ba(g, h) =\n\ng\u2208\u0393(G),h\u2208\u0393(H)\n\ng\u2208\u0393(G),h\u2208\u0393(H)\n\n\u03c6\u03ba(g)T \u03c6\u03ba(h) = \u03a6\u03ba(G)T \u03a6\u03ba(H).\n\nWhen clear from the context, we will omit \u03ba and write \u03c6(g) and \u03a6(G) for the explicit maps of the\nsubstructure g and the graph G.\n\nString kernels The strings generated from subtree traversal will be compared using string kernels.\nk \u2282 \u03a3\u2217 be the\nLet \u03a3\u2217 be the set of all strings that can be generated from the alphabet \u03a3, and let \u03a3\u2217\nset of strings with exactly k characters. Let t (cid:118) s denote that the string t is a substring of s, i.e.,\na nonempty sequence of consecutive characters from s. The spectrum string kernel compares the\ndistribution of k-grams between strings s1 and s2:\n\n(cid:88)\n\n(cid:88)\n\nt\u2208\u03a3\u2217\n\nk\n\n\u03bak (s1, s2) =\n\n#t(s1)#t(s2)\n\npoly(s1, s2) = (\u03c6(s1)T \u03c6(s2) + c)p.\n\nwhere #t(s) = |{x : x (cid:118) s and x = t}|, i.e., the number of occurrences of t in s [Leslie et al., 2002].\nThe explicit feature map for the spectrum kernel is thus the frequency vector \u03c6(s) \u2208 N|\u03a3\u2217\nk| such that\n\u03c6i(s) = #t(s) where t is the i-th k-gram in the explicit enumeration of all k-grams.\nWe will consider extensions of the spectrum kernel with the polynomial kernel for p \u2208 N: for a\nconstant c \u2265 0,\nThis accommodates cosine kernel cos(s1, s2) when feature vectors are normalized as \u03c6(s)/(cid:107)\u03c6(s)(cid:107).\nCount-Sketch and Tensor-Sketch Sketching is an algorithmic tool for the summarization of\nmassive datasets such that key properties of the data are preserved. In order to achieve scalability,\nwe will summarize the k-gram frequency vector distributions. In particular, we will use Count-\nSketch [Charikar et al., 2004] that for vectors u, v \u2208 Rd computes sketches z(u), z(v) \u2208 Rb such that\nz(u)T z(v) \u2248 uT v and b < d controls the approximation quality. A key property is that Count-Sketch\nis a linear projection of the data and this will allow us to incrementally generate strings and sketch their\nk-gram distribution. For the polynomial kernel poly(x, y) = (xT y + c)p and x, y \u2208 Rd, the explicit\nfeature map of x and y is their p-level tensor product, i.e., the dp-dimensional vector formed by taking\nthe product of all subsets of p coordinates of x or y. Hence, computing the explicit feature map and\nthen sketching it using Count-Sketch requires O(dp) time. Instead, using Tensor-Sketch [Pham and\nPagh, 2013], we compute a sketch of size b for a p-level tensor product in time O(p(d + b log b)).\n\n3 Main results\n\nIn this section we \ufb01rst describe the proposed algorithm, discuss in detail its components, and then\npresent theoretical approximation guarantees for using sketches to approximate graph kernels.\nAlgorithm The proposed algorithm is based on the following key ideas: (a) representation of each\nnode v\u2019s neighborhood by a string Sv using a tree traversal method, and (b) approximating the k-gram\nfrequency vector of string Sv using sketching in a way that does not require storing the string Sv.\nGiven a graph G, for each node v we traverse the subtree rooted at v using the neighbor ordering \u03c4\nand generate a string. The subtrees represent the graph decomposition of the convolutional kernel.\nThe algorithm allows for \ufb02exibility in choosing different alternatives for the subtree traversal. The\ngenerated strings are compared by a string kernel. This string kernel is evaluated by computing an\nexplicit feature map for the string at each node. Scalability is achieved by approximating explicit\nfeature maps using sketching techniques so that the kernel can be approximated within a prescribed\napproximation error. The sum of the node explicit feature maps is the explicit feature map of the\ngraph G. The algorithm is outlined in Algorithm 1.\nTree traversal and string generation There are different options for string construction from each\nnode neighborhood. We present a general class of subgraph traversal algorithms that iteratively\ncollect the node strings from the respective neighborhood.\n\nDe\ufb01nition 1 Let Sh\nalgorithm is called a composite string generation traversal (CSGT) if Sh\nsubset of the strings s0\nthe strings si\u22121\n\nv denote the string collected at node v after h iterations. A subgraph traversal\nv is a concatenation of a\nv is computed in the i-th iteration and is the concatenation of\n\nfor u \u2208 Nv, in the order given by \u03c4v.\n\nv, . . . , sh\n\nv. Each si\n\nu\n\n4\n\n\fAlgorithm 1: EXPLICITGRAPHFEATUREMAP.\nInput: Graph G = (V, E, (cid:96), \u03c4 ), depth h, labeling (cid:96) : V \u2192 L, base kernel \u03ba\nfor v \u2208 V do\n\nTraverse the subgraph Tv rooted at v up to depth h\nCollect the node labels (cid:96)(u) : u \u2208 Tv in the order speci\ufb01ed by \u03c4v into a string Sv\nSketch the explicit feature map \u03c6\u03ba(Sv) for the base string kernel \u03ba (without storing Sv)\n\n\u03a6\u03ba(G) \u2190(cid:80)\n\nreturn \u03a6\u03ba(G)\n\nv\u2208V \u03c6\u03ba(Sv)\n\nThe above de\ufb01nition essentially says that we can iteratively compute the strings collected at a node\nv from strings collected at v and v\u2019s neighbors in previous iterations, similarly to the dynamic\nprogramming paradigm. As we formally show later, this implies that we will be able to collect all\nnode strings Sh\nv by traversing O(m) edges in each iteration and this is the basis for designing ef\ufb01cient\nalgorithm for computing the explicit feature maps.\nNext we present two examples of CSGT algorithms. The \ufb01rst one is the standard iterative\nbreadth-\ufb01rst search algorithm that for each node v collects in h + 1 lists the labels of all\nnodes within exactly i hops, for 0 \u2264 i \u2264 h. The strings si\nv collect the labels of nodes\nwithin exactly i hops from v. After h iterations, we concatenate the resulting strings, see Al-\ngorithm 2. In the toy example in Figure 1, the string at the node with label A is generated as\nv = EF GH).\nS2\nv = s0\n\nv resulting in A|BCDG|EF|H|G (s0\nvs2\n\nv = BCDG and s2\n\nv = A, s1\n\nvs1\n\nAlgorithm 2: BREADTH-FIRST SEARCH\nInput: Graph G = (V, E, (cid:96), \u03c4 ), depth h, labeling (cid:96) : V \u2192 S\nfor v \u2208 V do\n\nAlgorithm 3: WEISFEILER-LEHMAN\nInput: Graph G = (V, E, (cid:96), \u03c4 ), depth h, labeling (cid:96) : V \u2192 S\nfor v \u2208 V do\n\ns0\nv = (cid:96)(v)\nfor i = 1 to h do\nfor v \u2208 V do\nsi\nv = $\nfor u \u2208 \u03c4v(Nv) do\n\nfor v \u2208 V do\nv = $\n\nSh\nfor i = 0 to h do\nv \u2190 Sh\nSh\n\nv .append(si\nv)\n\n//$ is the empty string\nv.append(si\u22121\nu )\n\nv \u2190 si\nsi\n\nfor i = 1 to h do\nv \u2190 (cid:96)(v)\nsi\n\nfor i = 1 to h do\nfor v \u2208 V do\n\nfor v \u2208 V do\nv \u2190 sh\nSh\n\nv\n\nfor u \u2208 \u03c4v(Nv) do\n\nv \u2190 si\nsi\n\nv.append(si\u22121\nu )\n\nv, i.e., si\n\nv = A and s1\n\nv = ABCD and s2\n\nv becomes v\u2019s new label. We follow the CSGT pattern by setting Sh\nu, u \u2208 Nv: BEF , CH, DG and G.\n\nAnother approach, similar to the WL labeing algorithm [Shervashidze et al., 2011], is to concatenate\nthe neighbor labels in the order given by \u03c4v for each node v into a new string. In the i-th iteration\nwe set (cid:96)(v) = si\nv,\nv = sh\nas evident from Algorithm 3. In our toy example, we have s0\nv =\nABEF CHDGG generated from the neighbor strings s1\nString kernels and WL kernel smoothing: After collecting the strings at each node we have to compare\nthem. An obvious choice would be the Dirac kernel which compares two strings for equality. This\nwould yield poor results for graphs of larger degree where most collected strings will be unique, i.e.,\nthe diagonal dominance problem where most graphs are similar only to themselves. Instead, we\nconsider extensions of the spectrum kernel [Leslie et al., 2002] comparing the k-gram distributions\nbetween strings, as discussed in Section 2.\nSetting k = 1 is equivalent to collecting the node labels disregarding the neighbor order and com-\nparing the label distribution between all node pairs. In particular, consider the following smoothing\nalgorithm for the WL kernel. In the \ufb01rst iteration we generate node strings from neighbor labels\nand relabel all nodes such that each string becomes a new label. Then, in the next iteration we\nagain generate strings at each node but instead of comparing them for equality with the Dirac kernel,\nwe compare them with the polynomial or cosine kernels. cos(s1, s2)p decreases faster with p for\ndissimilar strings, thus p can be seen as a smoothing parameter.\nSketching of k-gram frequency vectors The explicit feature maps for the polynomial kernel for\np > 1 can be of very high dimensions. A solution is to \ufb01rst collect the strings Sh\nv at each node, then\nincrementally generate k-grams and feed them into a sketching algorithm that computes compact\nrepresentation for the explicit feature maps of polynomial kernel. However, for massive graphs with\nhigh average degree, or for a large node label alphabet size, we may end up with prohibitively long\nunique strings at each node. Using the key property of the incremental string generation approach\n\n5\n\n\fv . More concretely, we will replace the line si\n\nand a sketching algorithm, which is a linear projection of the original data onto a lower-dimensional\nspace, we will show how to sketch the k-gram distribution vectors without explicitly generating\nthe strings Sh\nu ) in Algorithms 2\nand 3 with a sketching algorithm that will maintain the k-gram distribution of each si\nv\u2019s\nv as well as si\n(k \u2212 1)-pre\ufb01x and (k \u2212 1)-suf\ufb01x. In this way we will only keep track of newly generated k-grams\nand add up the sketches of the k-gram distribution of the si\nv strings computed in previous iterations.\nBefore we present the main result, we show two lemmas that state properties of the incremental string\ngeneration approach. Observing that in each iteration we concatenate at most m strings, we obtain\nthe following bound on the number of generated k-grams.\n\nv.append(si\u22121\n\nv \u2190 si\n\nLemma 1 The total number of newly created k-grams during an iteration of CSGT is O(mk).\n\nThe next lemma shows that in order to compute the k-gram distribution vector we do not need to\nexplicitly store each intermediate string si\nv but only keep track of the substrings that will contribute\nto new k-grams and si\nv\u2019s k-gram distribution. This allows us to design ef\ufb01cient algorithms by\nmaintaining sketches for k-gram distribution of the si\n\nv strings.\n\nLemma 2 The k-gram distribution vector of the strings si\niteration of CSGT from the distribution vectors of the strings si\u22121\ntotal length O(mk).\n\nv\n\nv at each node v can be updated after an\nand explicitly storing substrings of\n\nThe above lemma is the basis for the sketching solution we present next. It will allow us to sketch\nthe k-gram distribution vectors at each node and only keep track of the pre\ufb01xes and suf\ufb01xes of the\nu.\nstrings si\nThe following theorem is our main result that accommodates both polynomial and cosine kernels. We\nde\ufb01ne cosh\nk(u, v)p to be the cosine similarity to the power p between the k-gram distribution vectors\ncollected at nodes u and v after h iterations.\n\nTheorem 1 Let G1, . . . , GM be a collection of M graphs, each having at most m edges and n nodes.\nLet K be either polynomial or cosine kernel with parameter p and \u02c6K its approximation obtained by\nusing size-b sketches of explicit feature maps. Consider an arbitrary pair of graphs Gi and Gj. Let\nT<\u03b1 denote the number of node pairs vi \u2208 Gi, vj \u2208 Gj such that cosh\nk(vi, vj)p < \u03b1 and R be an\nupper bound on the norm of the k-gram distribution vector at each node.\nThen, we can choose a sketch size b = O( log M +log n\nerror of at most \u03b5(K(Gi, Gj) + R2p\u03b1T<\u03b1) with probability at least 1 \u2212 \u03b4, for \u03b5, \u03b4 \u2208 (0, 1).\nA graph sketch can be computed in time O(mkph + npb log b) and space O(nb).\n\n\u03b4 ) such that \u02c6K(Gi, Gj) has an additive\n\nlog 1\n\n\u03b12\u03b52\n\nNote that for the cosine kernel it holds R = 1. Assuming that p, k and h are small constants, the\nrunning time per graph is linear and the space complexity is sublinear in the number of edges. The\napproximation error bounds are for the general worst case and can be better for skewed distributions,\nthis follows directly from the properties of the original Count-Sketch algorithm [Charikar et al.,\n2004].\n\nGraph streams We can extend the above algorithms to work in the semi-streaming graph model\n[Feigenbaum et al., 2005] where we can afford O(n polylog(n)) space. Essentially, we can store\na compact sketch per each node but we cannot afford to store all edges. We sketch the k-gram\ndistribution vectors at each node v in h passes. In the i-pass, we sketch the distribution of si\nv from\nfor u \u2208 Nv and the newly computed k-grams. We obtain following result:\nthe sketches si\u22121\n\nu\n\nTheorem 2 Let E be a stream of labeled edges arriving in arbitrary order, each edge ei belonging\nto one of M graphs over N different nodes. We can compute a sketch of each graph Gi in h passes\nover the edges by storing a sketch of size b per node using O(N b) space in time O(|E|hkp + b log b).\n\nThe above result implies that we can sketch real-time graph streams in a single pass over the data,\ni.e. h = 1. In particular, for constants k and p we can compute explicit feature maps of dimension b\nfor the convolutional kernel for real-time streams for the polynomial and cosine kernels for 1-hop\nneighborhood and parameter p in time O(|E| + N b log b) using O(N b) space.\n\n6\n\n\f4 Experiments\n\nIn this section we present our evaluation of the classi\ufb01cation accuracy and computation speed of\nour algorithm and comparison with other kernel-based algorithms using a set of real-world graph\ndatasets. We \ufb01rst present evaluation for general graphs without ordering of node neighborhoods,\nwhich demonstrate that our algorithm achieves comparable and in some cases better classi\ufb01cation\naccuracy than the state of the art kernel-based approaches. We then present evaluation for graphs\nwith ordered neighborhoods that demonstrates that accounting for neighborhood ordering can lead to\nmore accurate classi\ufb01cation as well as the scalability of our algorithm.\nAll algorithms were implemented in Python 3 and experiments performed on a Windows 10 laptop\nwith an Intel i7 2.9 GHz CPU and 16 GB main memory. For the TensorSketch implementation,\nwe used random numbers from the Marsaglia Random Number CDROM [mar]. We used Python\u2019s\nscikit-learn implementation [Pedregosa et al., 2011] of the LIBLINEAR algorithm for linear support\nvector classi\ufb01cation [Fan et al., 2008].\nFor comparison with other kernel-based methods, we implemented the explicit map versions of the\nWeisfelier-Lehman kernel (WL) [Shervashidze et al., 2011], the shortest path kernel (SP) [Borgwardt\nand Kriegel, 2005] and the k-walk kernel (KW) [Kriege et al., 2014].\n\nGeneral graphs We evaluated the algorithms on widely-used benchmark datasets from various\ndomains [Kersting et al., 2016]. MUTAG [Debnath et al., 1991], ENZYMES [Schomburg et al.,\n2004], PTC [Helma et al., 2001], Proteins [Borgwardt et al., 2005] and NCI1 [Wale and Karypis,\n2006] represent molecular structures, and MSRC [Neumann et al., 2016] represents semantic image\nprocessing graphs. Similar to previous works [Niepert et al., 2016, Yanardag and Vishwanathan,\n2015], we choose the optimal number of hops h = 2 for the WL kernel and k \u2208 {5, 6} for the k-walk\nkernel. We performed 10-fold cross-validation using 9 folds for training and 1 fold for testing. The\noptimal regularization parameter C for each dataset was selected from {0.1, 1, 10, 100}. We ran the\nalgorithms on 30 random permutations of the neighbor node lists and report the average accuracy and\nthe average standard deviation. We set the parameter subtree depth parameter h to 2 and used the\noriginal graph labels, and in the second setting we obtained new labels using one iteration of WL. If\nthe explicit feature maps for the cosine and polynomial kernel have dimensionality more than 5,000,\nwe sketched the maps using TensorSketch with sketch size of 5,000.\nThe results are presented in Table 1.\nIn brackets we give the parameters for which we obtain\nthe optimal value:\nthe kernel, cosine or polynomial with or without relabeling and the power\np \u2208 {1, 2, 3, 4} of the polynomial (e.g. poly-rlb-1 denotes polynomial kernel with relabeling and\np = 1). We see that among the four algorithms, KONG achieves the best or second best results. We\nwould like to note that the methods are likely to admit further improvements by learning data-speci\ufb01c\nstring generation algorithms but such considerations are beyond the scope of the paper.\n\nDataset\n\nMutag\n\nEnzymes\n\nPTC\n\nProteins\n\nNCI1\n\nMSRC\n\nKW\n\n83.7 \u00b1 1.2\n34.8 \u00b1 0.7\n57.7 \u00b1 1.1\n70.9 \u00b1 0.4\n74.1 \u00b1 0.3\n92.9 \u00b1 0.8\n\nSP\n\n84.7 \u00b1 1.3\n39.6 \u00b1 0.8\n59.1 \u00b1 1.3\n72.7 \u00b1 0.5\n73.3 \u00b1 0.3\n91.2 \u00b1 0.9\n\nWL\n\n84.9 \u00b1 2.1\n52.9 \u00b1 1.1\n62.4 \u00b1 1.2\n71.4 \u00b1 0.7\n81.4 \u00b1 0.3\n91.0 \u00b1 0.7\n\nKONG\n\n87.8 \u00b1 0.7 (poly-rlb-1)\n50.1 \u00b1 1.1 (cosine-rlb-2)\n63.7 \u00b1 0.8 (cosine-2)\n73.0 \u00b1 0.6 (cosine-rlb-1)\n76.4 \u00b1 0.3 (cosine-rlb-1)\n\n95.2 \u00b1 1.3 (poly-1)\n\nTable 1: Classi\ufb01cation accuracies for general labeled graphs (the 1-gram case).\n\nGraphs with ordered neigborhoods We performed experiments on three datasets of graphs with\nordered neighborhoods (de\ufb01ned by creation time of edges). The \ufb01rst dataset was presented in [Man-\nzoor et al., 2016] and consists of 600 web browsing graphs from six different classes over 89.77M\nedges and 5.04M nodes. We generated the second graph dataset from the popular MovieLens\ndataset [mov] as follows. We created a bipartite graph with nodes corresponding to users and movies\nand edges connecting a user to a movie if the user has rated the movie. The users are labeled into\nfour categories according to age and movies are labeled with a genre, for a total of 19 genres. We\n\n7\n\n\fFigure 2: Comparison of classi\ufb01cation accuracy for graphs with ordered neighborhoods.\n\nconsidered only movies with a single genre. For each user we created a subgraph from its 2-hop\nneighborhood and set its class to be the user\u2019s gender. We generated 1,700 graphs for each gender.\nThe total number of edges is about 99.09M for 14.3M nodes. The third graph dataset was created\nfrom the Dunnhumby\u2019s retailer dataset [dun]. Similarly to the MovieLens dataset we created a\nbipartite graph for customer and products where edges represent purchases. Users are labeled in four\ncategories according to their af\ufb02uence, and products belong to one of nine categories. Transactions\nare ordered by timestamps and products in the same transaction are ordered in alphabetical order. The\ntotal number of graphs is 1,565, over 257K edges and 244K nodes. There are 7 classes corresponding\nto the user\u2019s life stage. The classes have unbalanced distribution, and we optimized the classi\ufb01er to\ndistinguish between a class with frequency 0.0945% and all other classes. The optimal C-value for\nSVM optimization was selected from 10i for i \u2208 {\u22121, 0, 1, 2, 3, 4}.\nResults The average classi\ufb01cation accuracies over 1,000 runs of different methods for different\ntraining-test size splits are shown in Figure 2. We exclude the SP kernel from the graph because\neither the running time was infeasible or the results were much worse compared to the other methods.\nFor all datasets, for the k-walk kernel we obtained best results for k = 1, corresponding to collecting\nthe labels of the endpoints of edges. We set h = 1 for both WL and KONG. We obtained best\nresults for the cosine kernel with p = 1. The methods compared are those for 2 grams with ordered\nneighborhoods and shuf\ufb02ed neighborhoods, thus removing the information about order of edges.\nWe also compare with using only 1 grams. Overall, we observe that accounting for the information\nabout the order of neighborhoods can improve classi\ufb01cation accuracy for a signi\ufb01cant margin. We\nprovide further results in Table 2 for training set sizes 80% showing also dimension of the explicit\nfeature map D, computation time (the \ufb01rst value is the time to compute the explicit feature maps and\nthe second value is the SVM training time), and accuracies and AUC metrics. We observe that the\nWL kernel can generate very long strings, i.e., explicit feature maps of large dimension, which not\nonly lead to the diagonal dominance problem but also result in large computation time; our method\ncontrols this by using k-grams.\n\nWeb browsing\n\nMovieLens\n\nDunnhumby\n\nMethod\n\nSP\n\nKW\n\nWL\n\nK-1\n\nK-2\n\nshuf\ufb02ed\n\nK-2\n\nD\n\n\u2212\n\n82\n\n20,359\n\n34\n\n264\n\n203\n\nTime\n\n> 24\nhrs\n\n665\u201d\n116\u201d\n\n48\u201d\n576\u201d\n\n206\u201d\n79\u201d\n\n220\u201d\n255\u201d\n\n217\u201d\n249\u201d\n\nAccuracy\n\nAUC\n\u2212\n\u2212\n\n99.80\n99.94\n\n99.92\n99.99\n\n99.88\n99.97\n\n99.81\n99.94\n\n99.95\n99.99\n\nD\n\n\u2212\n\n136\n\n> 2M\n\n21\n\n326\n\n326\n\nTime\n\n> 24\nhrs\n\n120\u201d\n420\u201d\n492\u201d\u2212\n509\u201d\n197\u201d\n\n592\u201d\n497\u201d\n\n589\u201d\n613\u201d\n\nAccuracy\n\nAUC\n\u2212\n\u2212\n\n66.98\n73.65\n\u2212\n\u2212\n\n65.83\n71.00\n\n67.01\n73.31\n\n67.68\n73.20\n\nD\n\n228\n\n56\n\n2,491\n\n13\n\n85\n\n82\n\nTime\n\n144\u201d\n74\u201d\n\n0.7\u201d\n134\u201d\n\n22\u201d\n230\u201d\n\n42\u201d\n25\u201d\n\n48\u201d\n131\u201d\n\n46\u201d\n133\u201d\n\nAccuracy\n\nAUC\n\n90.61\n50.1\n\n90.57\n58.47\n\n90.52\n57.80\n\n90.53\n59.33\n\n90.57\n61.07\n\n90.56\n61.94\n\nTable 2: Comparison of the accuracy and speed of different methods for graphs with ordered\nneighborhoods; we use the notation K-k to denote KONG using k grams; time shows explicit map\ncomputation time and SVM classi\ufb01cation time.\n\n8\n\n\f5 Conclusions\n\nWe presented an ef\ufb01cient algorithmic framework KONG for learning graph kernels for graphs with\nordered neighborhoods. We demonstrated the applicability of the approach and obtained performance\nbene\ufb01ts for graph classi\ufb01cation tasks over other kernel-based approaches.\nThere are several directions for future research. An interesting research question is to explore how\nmuch graph classi\ufb01cation can be improved by using domain speci\ufb01c neighbor orderings. Another\ndirection is to obtain ef\ufb01cient algorithms that can generate explicit graph feature maps but compare\nthe node strings with more complex string base kernels, such as mismatch or string alignment kernels.\n\nAcknowledgements. The work has been supported by a research collaboration grant funded by\nHuawei Technologies.\n\nReferences\nDunnhumby dataset. URL https://www.dunnhumby.com/sourcefiles.\n\nMarsaglia random number cd-rom. URL https://web.archive.org/web/20160119150146/http://\n\nstat.fsu.edu/pub/diehard/cdrom/.\n\nMovielens dataset. URL https://grouplens.org/datasets/movielens/.\n\nKarsten M. Borgwardt and Hans-Peter Kriegel. Shortest-path kernels on graphs. In Proceedings of the 5th IEEE\n\nInternational Conference on Data Mining (ICDM 2005), pages 74\u201381, 2005.\n\nKarsten M. Borgwardt, Cheng Soon Ong, Stefan Sch\u00f6nauer, S. V. N. Vishwanathan, Alexander J. Smola, and\nHans-Peter Kriegel. Protein function prediction via graph kernels. In Proceedings Thirteenth International\nConference on Intelligent Systems for Molecular Biology 2005, Detroit, MI, USA, 25-29 June 2005, pages\n47\u201356, 2005.\n\nMoses Charikar, Kevin C. Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor.\n\nComput. Sci., 312(1):3\u201315, 2004.\n\nA. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, , and C. Hansch. Structure-activity\nrelationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital\nenergies and hydrophobicity. J. Med. Chem., 34:786\u2014797, 1991.\n\nRong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A library for\n\nlarge linear classi\ufb01cation. Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\nJoan Feigenbaum, Sampath Kannan, Andrew McGregor, Siddharth Suri, and Jian Zhang. On graph problems in\n\na semi-streaming model. Theor. Comput. Sci., 348(2-3):207\u2013216, 2005.\n\nSanto Fortunato. Community detection in graphs. Physics reports, 486(3-5):75\u2013174, 2010.\n\nThomas G\u00e4rtner, Peter A. Flach, and Stefan Wrobel. On graph kernels: Hardness results and ef\ufb01cient alternatives.\n\nIn 16th Annual Conference on Computational Learning Theory, COLT 2003, pages 129\u2013143, 2003.\n\nAditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 855\u2013864, 2016.\n\nDavid Haussler. Convolution kernels on discrete structures, 1999.\n\nC. Helma, R. D. King, S. Kramer, and A. Srinivasan. The predictive toxicology challenge 2000\u20132001. Bioinfor-\n\nmatics, 17:107\u2013108, 2001.\n\nThorsten Joachims. Training linear SVMs in linear time.\n\nIn Proceedings of the Twelfth ACM SIGKDD\n\nInternational Conference on Knowledge Discovery and Data Mining, 2006, pages 217\u2013226, 2006.\n\nKristian Kersting, Nils M. Kriege, Christopher Morris, Petra Mutzel, and Marion Neumann. Benchmark data\n\nsets for graph kernels, 2016. URL http://graphkernels.cs.tu-dortmund.de.\n\nNils Kriege, Marion Neumann, Kristian Kersting, and Petra Mutzel. Explicit versus implicit graph feature maps:\nA computational phase transition for walk kernels. In 2014 IEEE International Conference on Data Mining,\nICDM 2014, pages 881\u2013886, 2014.\n\n9\n\n\fNils M. Kriege, Marion Neumann, Christopher Morris, Kristian Kersting, and Petra Mutzel. A unifying\nview of explicit and implicit feature maps for structured data: Systematic studies of graph kernels. CoRR,\nabs/1703.00676, 2017. URL http://arxiv.org/abs/1703.00676.\n\nQuoc V. Le, Tam\u00e1s Sarl\u00f3s, and Alexander J. Smola. Fastfood - computing hilbert space expansions in loglinear\ntime. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, pages 244\u2013252,\n2013.\n\nC. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein classi\ufb01cation. In\n\nProceedings of the Paci\ufb01c Symposium on Biocomputing, volume 7, pages 566\u2013575, 2002.\n\nEmaad A. Manzoor, Sadegh M. Milajerdi, and Leman Akoglu. Fast memory-ef\ufb01cient anomaly detection in\nstreaming heterogeneous graphs. In Proceedings of the 22nd ACM SIGKDD International Conference on\nKnowledge Discovery and Data Mining, 2016, pages 1035\u20131044, 2016.\n\nGiovanni Da San Martino, Nicol\u00f2 Navarin, and Alessandro Sperduti. A tree-based kernel for graphs.\n\nIn\nProceedings of the Twelfth SIAM International Conference on Data Mining, Anaheim, California, USA, April\n26-28, 2012., pages 975\u2013986, 2012.\n\nGiovanni Da San Martino, Nicol\u00f2 Navarin, and Alessandro Sperduti. A lossy counting based approach for\nlearning on streams of graphs on a budget. In IJCAI 2013, Proceedings of the 23rd International Joint\nConference on Arti\ufb01cial Intelligence, Beijing, China, August 3-9, 2013, pages 1294\u20131301, 2013.\n\nMarion Neumann, Roman Garnett, Christian Bauckhage, and Kristian Kersting. Propagation kernels: ef\ufb01cient\n\ngraph kernels from propagated information. Machine Learning, 102(2):209\u2013245, 2016.\n\nMathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs.\nIn Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, pages 2014\u20132023,\n2016.\n\nF. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,\nV. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:\nMachine learning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\nBryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learning of social representations. In The\n20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD \u201914, pages\n701\u2013710, 2014.\n\nNinh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps. In The 19th ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, pages 239\u2013247,\n2013.\n\nAli Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural\nInformation Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information\nProcessing Systems, Vancouver, 2007, pages 1177\u20131184, 2007.\n\nJean-Fran\u00e7ois Rual, Kavitha Venkatesan, Tong Hao, Tomoko Hirozane-Kishikawa, Am\u00e9lie Dricot, Ning Li,\nGabriel F. Berriz, Francis D. Gibbons, Matija Dreze, Nono Ayivi-Guedehoussou, Niels Klitgord, Christophe\nSimon, Mike Boxem, Stuart Milstein, Jennifer Rosenberg, Debra S. Goldberg, Lan V. Zhang, Sharyl L. Wong,\nGiovanni Franklin, Siming Li, Joanna S. Albala, Janghoo Lim, Carlene Fraughton, Estelle Llamosas, Sebiha\nCevik, Camille Bex, Philippe Lamesch, Robert S. Sikorski, Jean Vandenhaute, Huda Y. Zoghbi, Alex Smolyar,\nStephanie Bosak, Reynaldo Sequerra, Lynn Doucette-Stamm, Michael E. Cusick, David E. Hill, Frederick P.\nRoth, and Marc Vidal. Towards a proteome-scale map of the human protein\u2013protein interaction network.\nNature, 437:1173 EP \u2013, 09 2005. URL http://dx.doi.org/10.1038/nature04209.\n\nI. Schomburg, A. Chang, C. Ebeling, M. Gremse, C. Heldt, G. Huhn, and D. Schomburg. Brenda, the enzyme\n\ndatabase: updates and major new developments. Nucleic Acids Research, 32D:431\u2013433, 2004.\n\nNino Shervashidze, S. V. N. Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten M. Borgwardt. Ef\ufb01cient\ngraphlet kernels for large graph comparison. In Proceedings of the Twelfth International Conference on\nArti\ufb01cial Intelligence and Statistics, AISTATS 2009, pages 488\u2013495, 2009.\n\nNino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt.\n\nWeisfeiler-Lehman graph kernels. Journal of Machine Learning Research, 12:2539\u20132561, 2011.\n\nNikil Wale and George Karypis. Comparison of descriptor spaces for chemical compound retrieval and\nclassi\ufb01cation. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), 18-22\nDecember 2006, Hong Kong, China, pages 678\u2013689, 2006.\n\nPinar Yanardag and S. V. N. Vishwanathan. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD\n\nInternational Conference on Knowledge Discovery and Data Mining, 2015, pages 1365\u20131374, 2015.\n\n10\n\n\f", "award": [], "sourceid": 2001, "authors": [{"given_name": "Moez", "family_name": "Draief", "institution": "Noah's Ark Labs, Huawei Research"}, {"given_name": "Konstantin", "family_name": "Kutzkov", "institution": "London School of Economics"}, {"given_name": "Kevin", "family_name": "Scaman", "institution": "Noah's Ark Lab, Huawei Technologies"}, {"given_name": "Milan", "family_name": "Vojnovic", "institution": "London School of Economics (LSE)"}]}