{"title": "Fast Kernels for String and Tree Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 585, "page_last": 592, "abstract": null, "full_text": "Fast Kernels for String and Tree Matching \n\nS. V. N. Vishwanathan \n\nDept. of Compo Sci. & Automation \n\nIndian Institute of Science \nBangalore, 560012, India \n\nvishy@csa . iisc . ernet . in \n\nAlexander J. Smola \n\nMachine Learning Group, RSISE \n\nAustralian National University \nCanberra, ACT 0200, Australia \nAlex . Smola@anu . edu . au \n\nAbstract \n\nIn this paper we present a new algorithm suitable for matching discrete \nobjects such as strings and trees in linear time, thus obviating dynarrtic \nprogramming with quadratic time complexity. Furthermore, prediction \ncost in many cases can be reduced to linear cost in the length of the se(cid:173)\nquence to be classified, regardless of the number of support vectors. This \nimprovement on the currently available algorithms makes string kernels \na viable alternative for the practitioner. \n\n1 Introduction \nMany problems in machine learning require the classifier to work with a set of discrete ex(cid:173)\namples. Common examples include biological sequence analysis where data is represented \nas strings [4] and Natural Language Processing (NLP) where the data is in the form a parse \ntree [3]. In order to apply kernel methods one defines a measure of similarity between \ndiscrete structures via a feature map \u00a2 : X ----+ Jek. \nHere X is the set of discrete structures (eg. the set of all parse trees of a language) and JeK \nis a Hilbert space. Furthermore, dot products then lead to kernels \n\nk(x, x') = (\u00a2(x ), \u00a2(X') ) \n\n(1) \n\nwhere x, x' E X. The success of a kernel method employing k depends both on the faithful \nrepresentation of discrete data and an efficient means of computing k. \nThis paper presents a means of computing kernels on strings [15, 7, 12] and trees [3] in \nlinear time in the size of the arguments, regardless of the weighting that is associated with \nany of the terms, plus linear time complexity for prediction, regardless of the number of \nsupport vectors. This is a significant improvement, since the so-far fastest methods [8, 3] \nrely on dynarrtic programming which incurs a quadratic cost in the length of the argument. \nNote that the method we present here is far more general than strings and trees, and it can \nbe applied to finite state machines, formal languages, automata, etc. to define new kernels \n[14]. However for the scope of the current paper we Iirrtit ourselves to a fast means of \ncomputing extensions of the kernels of [15, 3, 12]. \nIn a nutshell our idea works as follows: \nI: iE I \u00a2i (x )\u00a2i (x') , where the index set I may be large, yet the number of nonzero en(cid:173)\ntries is small in comparison to III- Then an efficient way of computing k is to sort the set \nof nonzero entries \u00a2(x) and \u00a2(X') beforehand and count only matching non-zeros. This \nis similar to the dot-product of sparse vectors in numerical mathematics. As long as the \nsorting is done in an intelligent manner, the cost of computing k is linear in the sum of \nnon-zeros entries combined. In order to use this idea for matching strings (which have a \n\nassume we have a kernel k(x, x') \n\n\fquadratically increasing number of substrings) and trees (which can be transformed into \nstrings) efficient sorting is realized by the compression of the set of all substrings into a \nsuffix tree. Moreover, dictionary keeping allows us to use arbitrary weightings for each of \nthe substrings and still compute the kernels in linear time. \n\n2 String Kernels \nWe begin by introducing some notation. Let A be a finite set which we call the alphabet. \nThe elements of A are characters. Let $ be a sentinel character such that $ tf. A. Any \nx E A k for k = 0, 1, 2 ... is called a string. The empty string is denoted by E and A * \nrepresents the set of all non empty strings defined over the alphabet A. \nIn the following we will use s , t , u , v, w, x, y, z E A * to denote strings and a, b, c E A to \ndenote characters. Ixl denotes the length of x , uv E A * the concatenation of two strings \nu , v and au the concatenation of a character and a string. We use xli : j] with 1 ::; i ::; j ::; \nIxl to denote the substring of x between locations i and j (both inclusive). If x = uvw for \nsome (possibly empty) u, v , w, then u is called a prefix of x while v is called a substring \n(also denoted by v [;;; x ) and w is called a suffix of x . Finally, numy(x ) denotes the number \nof occurrences of yin x . The type of kernels we will be studying are defined by \n\nk( x, X'): = L w s6s,s' = L nums(x ) nums(x ') ws. \n\n(2) \n\ns EA \" \n\nThat is, we count the number of occurrences of every string s in both x and x' and weight \nit by w s , where the latter may be a weight chosen a priori or after seeing data, e.g., for \ninverse document frequency counting [11]. This includes a large number of special cases: \n\u2022 Setting W s = 0 for all lsi > 1 yields the bag-of-characters kernel, counting simply \n\nsingle characters. \n\n\u2022 The bag-of-words kernel is generated by requiring s to be bounded by whitespace. \n\u2022 Setting Ws = 0 for all lsi> n yields limited range correlations of length n. \n\u2022 The k-spectrum kernel takes into account substrings of length k [J 2] . It is achieved \n\nby setting W s = 0 for all lsi i- k. \n\n\u2022 TFIDF weights are achieved by first creating a (compressed) list of all s including \n\nfrequencies of occurrence, and subsequently rescaling W s accordingly. \n\nAll these kernels can be computed efficiently via the construction of suffix-trees, as we will \nsee in the following sections. However, before we do so, let us turn to trees. The latter are \nimportant for two reasons: first since the suffix tree representation of a string will be used to \ncompute kernels efficiently, and secondly, since we may wish to compute kernels on trees, \nwhich will be carried out by reducing trees to strings and then applying a string-kernel. \n\n3 Tree Kernels \nA tree is defined as a connected directed graph with no cycles. A node with no children \nis referred to as a leaf A subtree rooted at node n is denoted as Tn and t F T is used to \nindicate that t is a subtree of T. If a set of nodes in the tree along with the corresponding \nedges forms a tree then we define it to be a subset tree. If every node n of the tree contains \na label, denoted by label( n), then the tree is called an labeled tree. If only the leaf nodes \ncontain labels then the tree is called an leaf-labeled tree. Kernels on trees can be defined \nby defining kernels on matching subset trees as proposed by [3] or (more restrictively) by \ndefining kernels on matching subtrees. In the latter case we have \n\nk(T, T') = L Wt6t ,t' . \n\n(3) \n\nOrdering Trees An ordered tree is one in which the child nodes of every node are ordered \nas per the ordering defined on the node labels. Unless there is a specific inherent order on \nthe trees we are given (which is, e.g., the case for parse-trees), the representation of trees is \n\nt FT ,t' FT' \n\n\fnot unique. For instance, the following two unlabeled trees are equivalent and can obtained \nfrom each other by reordering the nodes. \n\n~ c!0 To order trees we assume that a lexicographic or(cid:173)\n\nder is associated with the labels if they exist. Fur(cid:173)\nthermore, we assume that the additional symbols \n'[', '1' satisfy ' [' < '1', and that '1', '[' < label( n) for \nall labels. We will use these symbols to define \ntags for each node as follows: \n\nFigure 1: Two equivalent trees \n\n\u2022 For an unlabeled leaf n define tag( n) := [ l. \n\u2022 For a labeled leaf n define tag( n) : = [ label( n) 1 . \n\u2022 For an unlabeled node n with children nl, ... , nc sort the tags of the children in \n\nlexicographical order such that tag( n i) ::=; tag( nj) if i < j and define \n\ntag(n) = [tag(nl)tag(n2) ... tag(nc)l . \n\n\u2022 For a labeled node perform the same operations as above and set \n\ntag(n) = [ label(n)tag(nl)tag(n2) ... tag(nc) l . \n\nFor instance, the root nodes of both trees depicted above would be encoded as [[] [[] [lll. We \nnow prove that the tag of the root node, indeed, is a unique identifier and that it can be \nconstructed in log linear time. \nTheorem 1 Denote by T a binary tree with I nodes and let A be the maximum length of a \nlabel. Then the following properties hold for the tag of the root node: \n\n1. tag (root) can be computed in (A + 2)(llog21) time and linear storage in I. \n2. Substrings S oftag(root) starting with '[' and ending with a balanced '] ' corre(cid:173)\n\nspond to subtrees T' ofT where s is the tag on T'. \n\n3. Arbitrary substrings s oftag(root) correspond to subset trees T' ofT. \n4. tag (root) is invariant under permutations of the leaves and allows the reconstruc-\n\ntion of an unique element of the equivalence class (under permutation). \n\nProof We prove claim 1 by induction. The tag of a leaf can be constructed in constant time \nby storing [, ], and a pointer to the label of the leaf (if it exists), that is in 3 operations. Next \nassume that we are at node n, with children nl, n2. Let Tn contain In nodes and Tn, and \nTn2 contain h, 12 nodes respectively. By our induction assumption we can construct the tag \nfor nl and n2 in (A + 2)(h log2 h) and (A + 2)(12 log2 12) time respectively. Comparing \nthe tags of nl and n2 costs at most (A + 2) min(h, l2) operations and the tag itself can \nbe constructed in constant time and linear space by manipulating pointers. Without loss of \ngenerality we assume that h ::=; 12 \u2022 Thus, the time required to construct tag(n) (normalized \nby A + 2) is \n\nII (log2 11 + 1) + 1210g2 (1 2) = h log2 (2h) + l210g2 (12) ::=; In log2 (In). \n\n(4) \n\nOne way of visualizing our ordering is by imagining that we perform a DFS (depth first \nsearch) on the tree T and emit a '[' followed by the label on the node, when we visit a node \nfor the first time and a '1' when we leave a node for the last time. It is clear that a balanced \nsubstring s of tag (root) is emitted only when the corresponding DFS on T' is completed. \nThis proves claim 2. \nWe can emit a substring of tag( root) only if we can perform a DFS on the corresponding \nset of nodes. This implies that these nodes constitute a tree and hence by definition are \nsubset trees of T. This proves claim 3. \nSince leaf nodes do not have children their tag is clearly invariant under permutation. For an \ninternal node we perform lexicographic sorting on the tags of its children. This removes any \ndependence on permutations. This proves the invariance of tag(root) under permutations \nof the leaves. Concerning the reconstruction, we proceed as follows: each tag of a subtree \nstarts with ' [' and ends in a balanced '] ', hence we can strip the first [] pair from the tag, \n\n\ftake whatever is left outside brackets as the label of the root node, and repeat the procedure \nwith the balanced [ ... J entries for the children of the root node. This will construct a tree \nwith the same tag as tag(root), thus proving claim 4. \n\u2022 \n\nAn extension to trees with d nodes is straightforward (the cost increases to d log2 d of the \noriginal cost), yet the proof, in particular (4) becomes more technical without providing \nadditional insight, hence we omit this generalization for brevity. \nif we use \nCorollary 2 Kernels on trees T , T' can be computed via string kernels, \ntag(T) , tag(T') as strings. Ifwe require that only balanced [ . .. J substrings have nonzero \nweight W s then we obtain the subtree matching kernel defined in (3). \nThis reduces the problem of tree kernels to string kernels and all we need to show in the fol(cid:173)\nlowing is how the latter can be computed efficiently. For this purpose we need to introduce \nsuffix trees. \n\n4 Suffix Trees and Matching Statistics \n\nab \n\nabc$ \n\nDefinition The suffix tree is a compacted trie that stores all suffixes of a given text string. \nWe denote the suffix tree of the string x by S (x) . Moreover, let nodes( S( x)) be the set of \nall nodes of S (x) and let root (S (x)) be the root of S (x). For a node w, father (w) denotes \nits parent, T(w) denotes the subtree tree rooted at the node, Ivs(w) denotes the number of \nleaves in the subtree and path( w) := w is the path from the root to the node. That is, we \nuse the path w from root to node as the label of the node w. \n\nFigure 2: Suffix Tree of ababc \n\nWe denote by words(S(x)) the set of all \nstrings w such that wu E nodes(S(x )) for \nsome (possibly empty) string u, which means \nthat words(S(x)) is the set of all possible \nsubstrings of x. For every t E words(S(x)) \nwe define ceil ( t) as the node w such that \nw = tu and u is the shortest (possibly empty) \nsubstring such that w E nodes(S(x)). Similarly, for every t E words(S(x)) we define \nfloor(t) as the node w such that t = wu and u is the shortest (possibly empty) substring \nsuch that w E nodes(S(x )). Given a string t and a suffix tree S(x), we can decide if \nt E words(S(x)) in O(lt l) time by just walking down the corresponding edges of S(x). \nIf the sentinel character $ is added to the string x then it can be shown that for any t E \nwords(S(x)), lvs( ceil( t)) gives us the number of occurrence of t in x [5]. The idea works \nas follows: all suffixes of x starting with t have to pass through ceil(t), hence we simply \nhave to count the occurrences of the sentinel character, which can be found only in the \nleaves. Note that a simple depth first search (OFS) of S(x) will enable us to calculate \nIvs(w) for each node in S(x) in O(lxl) time and space. \nLet aw be a node in S(x), and v be the longest suffix of w such that v E nodes(S(x)). \nAn unlabeled edge aw ---+ v is called a suffix link in S (x). A suffix link of the form \naw ---+ W is called atomic. It can be shown that all the suffix links in a suffix tree are atomic \n[5, Proposition 2.9]. We add suffix links to S(x), to allow us to perform efficient string \nmatching: suppose we found that aw is a substring of x by parsing the suffix tree S (x). \nIt is clear that w is also a substring of x. If we want to locate the node corresponding to \nw, it would be wasteful to parse the tree again. Suffix links can help us locate this node in \nconstant time. The suffix tree building algorithms make use of this property of suffix links \nto perform the construction in linear time. The suffix tree construction algorithm of [13] \nconstructs the suffix tree and all such suffix links in linear time. \n\nMatching Statistics Given strings x, y with Ix l = nand Iy l = m, the matching statistics \nof x with respect to y are defined by v, C E p,[n, where Vi is the length of the longest \nsubstring of y matching a prefix of xli : n], Vi := i + v i - 1, Ci is a pointerto ceil(x[i : Vi]) \nand Ci is a pointer to floor(x [i : Vi]) in S(y). For an example see the table below. \n\n\fb \n2 \n\nString \n\na \n1 \nab \n\na \n2 \nab \n\nb \n1 \nb babeS \n\nTable 1: Matching statistic of abba with \nrespect to S (a b abc). \n\nFor a given y one can construct v, C correspond-\ning to x in linear time. The key observation is that \n::::: Vi - 1, since if xli : Vi] is a substring of \nVH I\ny then definitely xli + 1 : Vi] is also a substring of \ny. Besides this, the matching substring in y that we \nfind, must have xli + 1 : Vi] as a prefix. The Match(cid:173)\ning Statistics algorithm [2] exploits this observation and uses it to cleverly walk down the \nsuffix links of S(y) in order to compute the matching statistics in O( lxl ) time. \nMore specifically, the algorithm works by maintaining a pointer Pi = floor(x [i : Vi ]). It \nthen finds P H I = floor(x[i + 1 : Vi ]) by first walking down the suffix link of Pi and then \nwalking down the edges corresponding to the remaining portion of xli + 1 : Vi] until it \nreaches floor( x[i + 1 : Vi]) . Now VH I can be found easily by walking from P H I along the \nedges of S(y) that match the string x li + l : n], until we can go no further. The value of \nVI is found by simply walking down S(y) to find the longest prefix of x which matches a \nsubstring of y. \n\nMatching substrings Using V and C we can read off the number of matching substrings \nin x and y. The useful observation here is that the only substrings which occur in both x \nand y are those which are prefixes of x li : Vi] . The number of occurrences of a substring in \ny can be found by lvs(ceil(w)) (see Section 4). The two lemmas below formalize this. \nLemma 3 w is a substring of x iff there is an i such that w is a prefix of x li : n]. The \nnumbe r of occurrences of w in x can be calculated by finding all such i. \nLemma 4 The set of matching substrings of x and y is the set of all prefixes of xli : Vi] . \nProof Let w be a substring of both x and y. By above lemma there is an i such that w \nis a prefix of xli : n]. Since Vi is the length of the maximal prefix of xli : n] which is a \nsubstring in y, it follows that Vi ::::: Iw l. Hence w must be a prefix of x li : Vi] . \n\u2022 \n\n5 Weights and Kernels \nFrom the previous sections we know how to determine the set of all longest prefixes x li : Vi ] \nof x li : n] in y in linear time. The following theorem uses this information to compute \nkernels efficiently. \nTheorem 5 Let x and y be strings and c and V be the matching statistics of x with respect \nto y. Assume that \n\nW(y , t) = L Wus - W u where u = floor(t) and t = uv. \n\n(5) \n\nsE prefix(v) \n\ncan be computed in constant time for any t. Then k( x, y) can be computed in O(l x l + Iy l) \ntime as \n\nk(x, y) = L val(x[i : Vi ]) = L val(ci ) + lvs(ceil(x[i : Vi])) W(y , xli : Vi ]) \n\n(6) \n\nIxl \n\nIxl \n\ni = 1 \n\ni = 1 \n\nwhere val ( t) := lYse ceil ( t)) . W (y , t ) + val(floor( t)) and val ( root) := O. \nProof We first show that (6) can indeed be computed in linear time. We know that for S(y) \nthe number of leaves can be computed in linear time and likewise c, v. By assumption on \nW(y, t) and by exploiting the recursive nature of valet) we can compute W(y, nodes(i )) \nfor all the nodes of S(y) by a simple top down procedure in O(ly l) time. \nAlso, due to recursion, the second equality of (6) holds and we may compute each term in \nconstant time by a simple lookup for val(ci ) and computation of W(y , xli : Vi]) ' Since we \nhave Ixl terms, the whole procedure takes O( lxl ) time, which proves the O( lxl + Iyl) time \ncomplexity. \nNow we prove that (6) really computes the kernel. We know from Lemma 4 that the sum \nin (2) can be decomposed into the sum over matches between y and each of the prefixes \n\n\fof xli : Vi] (this takes care of all the substrings in x matching with y). This reduces the \nproblem to showing that each term in the sum of (6) corresponds to the contribution of all \nprefixes of x li : vJ \nAssume we descend down the path xli : Vi] in S(y) (e.g., for the string bab with respect \nto the tree of Figure 2 this would correspond to (root, b, bab\u00bb, then each of the prefixes t \nalong the path (e.g., (' , , b, ba, bab) for the example tree) occurs exactly as many times \nas Ivs( ceil( t)) does. In particular, prefixes ending on the same edge occur the same number \nof times. This allows us to bracket the sums efficiently, and W(y , x) simply is the sum \nalong an edge, starting from the ceiling of x to x . Unwrapping val(x ) shows that this is \nsimply the sum over the occurrences on the path of x, which proves our claim. \n\u2022 \nSo far, our claim hinges on the fact that W(y, t) can be computed in constant time, which \nis far from obvious at first glance. We now show that this is a reasonable assumption in all \npractical cases. \n\nIf the weights Ws depend only on ls i we have Ws = wisi. \nLength Dependent Weights \nDefine Wj := Li=l Wj and compute its values beforehand up to W J where J ~ Ix l for all \nx. Then it follows that \n\nIt I \n\nW(y , t) = L Wj - W I floor (tl l = Wlt l - WI floor(t l l \n\n(7) \n\nj= 1 ceil(tl l \n\nwhich can be computed in constant time. Examples of such weighting schemes are the \nkernels suggested by [15], where Wi = A - i, [7] where Wi = 1, and [10], where Wi = Olio \n\nIn case of generic weights, we have several options: recall that one \nGeneric Weights \noften will want to compute m 2 kernels k(x , x'), given m strings x E X. Hence we could \nbuild the suffix trees for Xi beforehand and annotate each of the nodes and characters on \nthe edges explicitly (at super-linear cost per string), which means that later, for the dot \nproducts, we will only need to perform table lookup of W( x , x' (i : Vi)). \nHowever, there is an even more efficient mechanism, which can even deal with dynamic \nweights, depending on the relative frequency of occurrence of the substrings in all x . We \ncan build a suffix tree I; of all strings in X. Again, this can be done in time linear in the \ntotal length of all the strings (simply consider the concatenation of all strings) . It can be \nshown that for all x and all i, xli : Vi] will be a node in this tree. Leaves-counting allows \nto compute these dynanUc weights efficiently, since I; contains all the substrings. \nFor W( x,x'(i : Vi)) we make ilie simplifying assumption that Ws = \u00a2 (Isl ) . \u00a2(freq(s)), \nthat is, Ws depends on length and frequency only. Now note that all the strings ending on \nthe same edge in I; will have the same weights assigned to them. Hence, can rewrite (5) as \n\nW(y , t) = L W s -\n\nL \n\nW s = \u00a2 (freq(t)) L \n\nIt I \n\n\u00a2 (i) \n\n(8) \n\ns Eprefix(tl \n\ns Eprefix(floor(tl l \n\ni= 1 floor(t l l+l \n\nwhere u = floor(t), t = uv and s E prefix(v). By precomputing L i \u00a2 (i) we can evaluate \n(8) in constant time. \nThe benefit of (8) is twofold: we can compute the weights of all the nodes of I; in time \nlinear in the total length of strings in X . Secondly, for arbitrary x we can compute W(y , t) \nin constant time, thus allowing us to compute k( Xi' x') in O(l xi l + Ix' l) time. \nLinear Time Prediction Let Xs = {Xl, X2 , . . . , x m} be the set of support vectors. \nRecall that, for prediction in a Support Vector Machine we need to compute f( x) = \nL : I Ctik(Xi, x ), which implies that we need to combine the contribution due to matching \nsubstrings from each one of the Support Vectors. We first construct S (Xs) in linear time by \nusing the [1] algorithm. In S(X 8 ) , we associate weight Cti with each leaf associated with \nthe support vector Xi . For a node V E nodes(S(X8)) we modify the definition of Ivs(v) \nas the sum of weights associated with the subtree rooted at node v. A straightforward ap(cid:173)\nplication of the matching statistics algorithm of [2] shows that we can find the matching \n\n\flsIrbda .. O.7ti _ \n\nSpectrum !(.ernel _ -\n\n-~-\n\n-\"1 \u2022 \n\ne.\\a.. \n\n\"._--...... _ ... _---\"',---.. _---\n\n\u00b0o~--~~----~----~----~----~ \nFigure 3: Total number of families for which an \nSVM classifier exceeds a ROC50 score threshold. \n\nBeing a proof of concept, we did not try to \ntune the soft margin SVM parameters (the \nmain point of the paper being the introduc(cid:173)\ntion of a novel means of evaluating string \nkernels efficiently rather than applications \n- a separate paper focusing on applications \nis in preparation). \nTable 3 contains the ROC 50 scores for the \nspectrum kernel with k = 3 [12] and our \nstring kernel with A = 0.75. We tested \nwith A E {0.25, 0.5, 0.75, O.g} and re(cid:173)\nport the best results here. As can be seen \nour kernel outperforms the spectrum ker(cid:173)\nnel on nearly every every family in the \ndataset. \n\nstatistics of x with respect to all strings in Xs in O(l xl ) time. Now Theorem 5, can be \napplied unchanged to compute f (x). A detailed account and proof can be found in [14]. \nIn summary, we can classify texts in linear time regardless of the size of the training set. \nThis makes SVM for large-scale text categorization practically feasible. Similar modifica(cid:173)\ntions can also be applied for training SMO like algorithms on strings. \n\n6 Experimental Results \n\nFor a proof of concept we tested our approach on a remote homology detection problem 1 \n[9] using Stafford Noble's SVM package2 as the training algorithm. A length weighted \nkernel was used and we assigned weights W s = Aisl for all substring matches of length \ngreater than 3 regardless of triplet boundaries. To evaluate performance we computed the \nROC 50 scores.3 \n\nIt should be noted that this is the first method to allow users to specify weights rather arbi(cid:173)\ntrarily for all possible lenghts of matching sequences and still be able to compute kernels at \nO(lxl + Ix' l) time, plus, to predict on new sequences at O(l xl ) time, once the set of support \nvectors is established.4 \n\n7 Conclusion \n\nWe have shown that string kernels need not come at a super-linear cost in SVMs and that \nprediction can be carried out at cost linear only in the length of the argument, thus providing \noptimal run-time behaviour. Furthermore the same algorithm can be applied to trees. \nThe methodology pointed out in our paper has several immediate extensions: for instance, \nwe may consider coarsening levels for trees by removing some of the leaves. For not \ntoo-unbalanced trees (we assume that the tree shrinks at least by a constant factor at each \ncoarsening) computation of the kernel over all coarsening levels can then be carried out at \ncost still linear in the overall size of the tree. The idea of coarsening can be extended to \napproximate string matching. If we remove characters, this amounts to the use of wildcards. \nLikewise, we can consider the strings generated by finite state machines and thereby com(cid:173)\npare the finite state machines themselves. This leads to kernels on automata and other \ndynamical systems. More details and extensions can be found in [14]. \n\nIDetails and data available at www.cse.ucsc.edu/research/compbio/discriminative. \n2 Available at www.cs.columbia.edu/compbio/svm. \n3The ROC 50 score [6, 12] is the area under the receiver operating characteristic curve (the plot of \ntrue positives as a function of false positives) up to the first 50 false positives. A score of I indicates \nperfect separation of positives from negatives, whereas a score of 0 indicates that none of the top 50 \nsequences selected by the algorithm were positives. \n\n4[12] obtain an O(kl xl ) algorithm in the (somewhat more restrictive) case ofws = 6k(lsl) . \n\n\fAcknowledgments We would like to thank Patrick Haffner, Daniela Pucci de Farias, and \nBob Williamson for comments and suggestions. This research was supported by a grant of \nthe Australian Research Council. SVNV thanks Trivium India Software and Netscaler Inc. \nfor their support. \n\nReferences \n[1] A. Amir, M. Farach, Z. Galil, R. Giancarlo, and K. Park. Dynamic dictionary match(cid:173)\n\ning. Journal of Computer and System Science, 49(2):208-222, October 1994. \n\n[2] w. I. Chang and E. L. Lawler. Sublinear approximate sting matching and biological \n\napplications. Algorithmica, 12(4/5):327-344, 1994. \n\n[3] M. Collins and N. Duffy. Convolution kernels for natural language. In T. G. Diet(cid:173)\nterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Pro(cid:173)\ncessing Systems 14, Cambridge, MA, 2001. MIT Press. \n\n[4] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: \nProbabilistic models of proteins and nucleic acids. Cambridge University Press, 1998. \n[5] R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A unifying \n\nview of linear-time suffix tree construction. Algorithmica, 19(3):331-353, 1997. \n\n[6] M. Gribskov and N. L. Robinson. Use of receiver operating characteristic (ROC) \nanalysis to evaluate sequence matching. Computers and Chemistry, 20(1):25-33, \n1996. \n\n[7] D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC(cid:173)\n\nCRL-99-1O, Computer Science Department, UC Santa Cruz, 1999. \n\n[8] R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, 2002. \n[9] T. S. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detect(cid:173)\ning remote protein homologies. Journal of Computational Biology, 7:95-114, 2000. \n[10] T. Joachims. Making large-scale SVM learning practical. In B. SchOlkopf, C. J. C. \nBurges, and A. J. Smola, editors, Advances in Kernel Methods-Support Vector \nLearning, pages 169-184, Cambridge, MA, 1999. MIT Press. \n\n[11] E. Leopold and J. Kindermann. Text categorization with support vector machines: \nHow to represent text in input space? Machine Learning, 46(3):423-444, March \n2002. \n\n[12] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM \nprotein classification. In Proceedings of the Pacific Symposium on Biocomputing, \npages 564-575, 2002. \n\n[13] E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249-260, 1995. \n[14] S. V. N. Vishwanathan. Kernel Methods: Fast Algorithms and Real Life Applications. \n\nPhD thesis, Indian Institute of Science, Bangalore, India, November 2002. \n\n[15] C. Watkins. Dynamic alignment kernels. In A. J. Smola, P. L. Bartlett, B. Scholkopf, \nand D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 39-50, \nCambridge, MA, 2000. MIT Press. \n\n\f", "award": [], "sourceid": 2272, "authors": [{"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "S.v.n.", "family_name": "Vishwanathan", "institution": null}]}