{"title": "Scalable Algorithms for String Kernels with Inexact Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 881, "page_last": 888, "abstract": "We present a new family of linear time algorithms based on sufficient statistics for string comparison with mismatches under the string kernels framework. Our algorithms improve theoretical complexity bounds of existing approaches while scaling well with respect to the sequence alphabet size, the number of allowed mismatches and the size of the dataset. In particular, on large alphabets with loose mismatch constraints our algorithms are several orders of magnitude faster than the existing algorithms for string comparison under the mismatch similarity measure. We evaluate our algorithms on synthetic data and real applications in music genre classification, protein remote homology detection and protein fold prediction. The scalability of the algorithms allows us to consider complex sequence transformations, modeled using longer string features and larger numbers of mismatches, leading to a state-of-the-art performance with significantly reduced running times.", "full_text": "Scalable Algorithms for String Kernels with Inexact\n\nMatching\n\nPavel P. Kuksa, Pai-Hsi Huang, Vladimir Pavlovic\n\nDepartment of Computer Science,\n\nRutgers University, Piscataway, NJ 08854\n\n{pkuksa,paihuang,vladimir}@cs.rutgers.edu\n\nAbstract\n\nWe present a new family of linear time algorithms for string comparison with\nmismatches under the string kernels framework. Based on suf\ufb01cient statistics, our\nalgorithms improve theoretical complexity bounds of existing approaches while\nscaling well in sequence alphabet size, the number of allowed mismatches and\nthe size of the dataset.\nIn particular, on large alphabets and under loose mis-\nmatch constraints our algorithms are several orders of magnitude faster than the\nexisting algorithms for string comparison under the mismatch similarity measure.\nWe evaluate our algorithms on synthetic data and real applications in music genre\nclassi\ufb01cation, protein remote homology detection and protein fold prediction. The\nscalability of the algorithms allows us to consider complex sequence transforma-\ntions, modeled using longer string features and larger numbers of mismatches,\nleading to a state-of-the-art performance with signi\ufb01cantly reduced running times.\n\n1 Introduction\n\nAnalysis of large scale sequential data has become an important task in machine learning and data\nmining, inspired by applications such as biological sequence analysis, text and audio mining. Clas-\nsi\ufb01cation of string data, sequences of discrete symbols, has attracted particular interest and has led to\na number of new algorithms [1, 2, 3, 4]. They exhibit state-of-the-art performance on tasks such as\nprotein superfamily and fold prediction, music genre classi\ufb01cation and document topic elucidation.\n\nClassi\ufb01cation of data in sequential domains is made challenging by the variability in the sequence\nlengths, potential existence of important features on multiple scales, as well as the size of the al-\nphabets and datasets. Typical alphabet sizes can vary widely, ranging in size from 4 nucleotides\nin DNA sequences, up to thousands of words from a language lexicon for text documents. Strings\nwithin the same class, such as the proteins in one fold or documents about politics, can exhibit wide\nvariability in the primary sequence content. Moreover, important datasets continue to increase in\nsize, easily reaching millions of sequences. As a consequence, the resulting algorithms need the\nability to ef\ufb01ciently handle large alphabets and datasets as well as establish measures of similarity\nunder complex sequence transformations in order to accurately classify the data.\n\nA number of state-of-the-art approaches to scoring similarity between pairs of sequences in a\ndatabase rely on \ufb01xed, spectral representations of sequential data and the notion of mismatch ker-\nnels, c.f. [2, 3].\nIn that framework an induced representation of a sequence is typically that of\nthe spectra (counts) of all short substrings (k-mers) contained within a sequence. The similarity\nscore is established by allowing transformations of the original k-mers based on different models of\ndeletions, insertions and mutations. However, computing those representations ef\ufb01ciently for large\nalphabet sizes and \u201dloose\u201d similarity models can be computationally challenging. For instance,\nthe complexity of an ef\ufb01cient trie-based computation [3, 5] of the mismatch kernel between two\nstrings X and Y strongly depends on the alphabet size and the number of mismatches allowed as\n\n\fO(km+1|\u03a3|m(|X| + |Y |)) for k-mers (contiguous substring of length k) with up to m mismatches\nand the alphabet size |\u03a3|. This limits the applicability of such algorithms to simpler transformation\nmodels (shorter k and m) and smaller alphabets, reducing their practical utility on complex real data.\nAs an alternative, more complex transformation models such as [2] lead to state-of-the-art predictive\nperformance at the expense of increased computational effort.\n\nIn this work we propose novel algorithms for modeling sequences under complex transformations\n(such as multiple insertions, deletions, mutations) that exhibit state-of-the-art performance on a\nvariety of distinct classi\ufb01cation tasks. In particular, we present new algorithms for inexact (e.g.\nwith mismatches) string comparison that improve currently known time bounds for such tasks and\nshow orders-of-magnitude running time improvements. The algorithms rely on an ef\ufb01cient implicit\ncomputation of mismatch neighborhoods and k-mer statistic on sets of sequences. This leads to\na mismatch kernel algorithm with complexity O(ck,m(|X| + |Y |)), where ck,m is independent of\nthe alphabet \u03a3. The algorithm can be easily generalized to other families of string kernels, such as\nthe spectrum and gapped kernels [6], as well as to semi-supervised settings such as the neighbor-\nhood kernel of [7]. We demonstrate the bene\ufb01ts of our algorithms on many challenging classi\ufb01ca-\ntion problems, such as detecting homology (evolutionary similarity) of remotely related proteins,\nrecognizing protein fold, and performing classi\ufb01cation of music samples. The algorithms display\nstate-of-the-art classi\ufb01cation performance and run substantially faster than existing methods. Low\ncomputational complexity of our algorithms opens the possibility of analyzing very large datasets\nunder both fully-supervised and semi-supervised setting with modest computational resources.\n\n2 Related Work\n\nOver the past decade, various methods were proposed to solve the string classi\ufb01cation problem,\nincluding generative, such as HMMs, or discriminative approaches. Among the discriminative ap-\nproaches, in many sequence analysis tasks, kernel-based [8] machine learning methods provide most\naccurate results [2, 3, 4, 6].\n\nSequence matching is frequently based on common co-occurrence of exact sub-patterns (k-mers,\nfeatures), as in spectrum kernels [9] or substring kernels [10]. Inexact comparison in this framework\nis typically achieved using different families of mismatch [3] or pro\ufb01le [2] kernels. Both spectrum-k\nand mismatch(k,m) kernel directly extract string features based on the observed sequence, X. On\nthe other hand, the pro\ufb01le kernel, proposed by Kuang et al. in [2], builds a pro\ufb01le [11] PX and\nuses a similar |\u03a3|k-dimensional representation, derived from PX . Constructing the pro\ufb01le for each\nsequence may not be practical in some application domains, since the size of the pro\ufb01le is dependent\non the size of the alphabet set. While for bio-sequences |\u03a3| = 4 or 20, for music or text classi\ufb01cation\n|\u03a3| can potentially be very large, on the order of tens of thousands of symbols. In this case, a very\nsimple semi-supervised learning method, the sequence neighborhood kernel, can be employed [7]\nas an alternative to lone k-mers with many mismatches.\n\nThe most ef\ufb01cient available trie-based algorithms [3, 5] for mismatch kernels have a strong depen-\ndency on the size of alphabet set and the number of allowed mismatches, both of which need to be\nrestricted in practice to control the complexity of the algorithm. Under the trie-based framework, the\nlist of k-mers extracted from given strings is traversed in a depth-\ufb01rst search with branches corre-\nsponding to all possible \u03c3 \u2208 \u03a3. Each leaf node at depth k corresponds to a particular k-mer feature\n(either exact or inexact instance of the observed exact string features) and contains a list of matching\nfeatures from each string. The kernel matrix is updated at leaf nodes with corresponding counts.\nThe complexity of the trie-based algorithm for mismatch kernel computation for two strings X and\nY is O(km+1|\u03a3|m(|X| + |Y |)) [3]. The algorithm complexity depends on the size of \u03a3 since dur-\ning a trie traversal, possible substitutions are drawn from \u03a3 explicitly; consequently, to control the\ncomplexity of the algorithm we need to restrict the number of allowed mismatches (m), as well as\nthe alphabet size (|\u03a3|). Such limitations hinder wide application of the powerful computational tool,\nas in biological sequence analysis, mutation, insertions and deletions frequently co-occur, hence\nestablishing the need to relax the parameter m; on the other hand, restricting the size of the alpha-\nbet sets strongly limits applications of the mismatch model. While other ef\ufb01cient string algorithms\nexist, such as [6, 12] and the suf\ufb01x-tree based algorithms in [10], they do not readily extend to the\nmismatch framework. In this study, we aim to extend the works presented in [6, 10] and close the\nexisting gap in theoretical complexity between the mismatch and other fast string kernels.\n\n\f3 Combinatorial Algorithm\n\nIn this section we will develop our \ufb01rst improved algorithm for kernel computations with mis-\nmatches, which serves as a starting point for our main algorithm in Section 4.\n\n3.1 Spectrum and Mismatch Kernels De\ufb01nition\n\nGiven a sequence X with symbols from alphabet \u03a3 the spectrum-k kernel [9] and the mismatch(k,m)\nkernel [3] induce the following |\u03a3|k-dimensional representation for the sequence:\n\n\u03a6(X) = X\u03b1\u2208X\n\nIm(\u03b1, \u03b3)!\u03b3\u2208\u03a3k\n\n,\n\n(1)\n\nwhere Im(\u03b1, \u03b3) = 1 if \u03b1 \u2208 Nk,m(\u03b3), and Nk,m(\u03b3) is the mutational neighborhood, the set of all\nk-mers that differ from \u03b3 by at most m mismatches. Note that, by de\ufb01nition, for spectrum kernels,\nm = 0.\n\nThe mismatch kernel is then de\ufb01ned as\n\nK(X, Y |k, m) = X\u03b3\u2208\u03a3k\n\ncm(\u03b3|X)cm(\u03b3|Y ),\n\n(2)\n\nwhere cm(\u03b3|X) = \u03a3\u03b1\u2208X Im(\u03b3, \u03b1) is the number of times a contiguous substring of length k (k-mer)\n\u03b3 occurs in X with no more than m mismatches.\n\n3.2\n\nIntersection-based Algorithm\n\nOur \ufb01rst algorithm presents a novel way of performing local inexact string matching with the fol-\nlowing key properties:\n\na. parameter independent: the complexity is independent of |\u03a3| and mismatch parameter m\n\nb. in-place: only uses min(2m, k) + 1 extra space for an auxiliary look-up table\nc. linear complexity: in k, the length of the substring (as opposed to exponential km)\n\nTo develop our \ufb01rst algorithm, we \ufb01rst write the mismatch kernel (Equation 2) in an equivalent form:\n\nnx\u2212k+1\n\nny\u2212k+1\n\nIm(a, xix:ix+k\u22121)Im(a, yiy:iy+k\u22121)\n\nK(X, Y |k, m) =\n\n=\n\n=\n\nXix=1\nXix=1\nXix=1\n\nXiy=1 Xa\u2208\u03a3k\nXiy=1\nXiy=1\n\nnx\u2212k+1\n\nny\u2212k+1\n\nnx\u2212k+1\n\nny\u2212k+1\n\n(3)\n\n(4)\n\n(5)\n\n|(N (xix:ix+k\u22121, m) \u2229 N (yiy:iy+k\u22121, m)|\n\nI(xix:ix+k\u22121, yiy:iy+k\u22121)\n\nwhere I(a, b) is the number of induced (neighboring) k-mers common between a, b (i.e. I(a, b) is\nthe size of intersection of mismatch neighborhoods of a and b). The key observation here is that if we\ncan compute I(a, b) ef\ufb01ciently then the kernel evaluation problem reduces to performing pairwise\ncomparison based on all pairs of observed k-mers, a and b, in the two sequences. The complexity\nfor such procedure is O(c|X||Y |) where c is the cost for evaluating I(a, b) for any given k-mers a\nand b. In fact, for \ufb01xed k, m and \u03a3, such quantity depends only on the Hamming distance d(a, b)\n(i.e. the number of mismatches) and can be evaluated in advance, as we will show in Section 3.3. As\na result, the intersection values can be looked up in a table in constant time during matching. Note\nthe summation now shows no explicit dependency on |\u03a3| and m. In summary, given two strings X\nand Y , the algorithm (Algorithm 1) compares pairs of observed k-mers from X and Y and computes\nthe mismatch kernel according to Equation 5.\n\n\fAlgorithm 1. (Hamming-Mismatch) Mismatch algorithm based on Hamming distance\nInput: strings X, Y, |X| = nx, |Y | = ny, parameters k, m, lookup table I for intersection sizes\nEvaluate kernel using Equation 5:\n\nK(X, Y |k, m) =Pnx\u2212k+1\n\nix=1 Pny\u2212k+1\n\nwhere I(d) is the intersection size for distance d\nOutput: Mismatch kernel value K(X, Y |k, m)\n\niy=1\n\nI(d(xix:ix+k\u22121, yiy:iy+k\u22121)|k, m)\n\nThe overall complexity of the algorithm is O(knxny) since the Hamming distances between all\nk-mer pairs observed in X and Y need to be known. In the following section, we discuss how to\nef\ufb01ciently compute the size of the intersection.\n\n3.3\n\nIntersection Size: Closed Form Solution\n\nThe number of neighboring k-mers shared by two observed k-mers a and b can be directly com-\nputed, in a closed-form, from the Hamming distance d(a, b) for \ufb01xed k and m, requiring no explicit\ntraversal of the k-mer space as in the case of trie-based computations. We \ufb01rst consider the case\na = b (i.e. d(a, b) = 0). The intersection size corresponds to the size of the (k, m)-mismatch\n\nneighborhood, i.e. I(a, b) = |Nk,m| = Pm\n\ntance d, the key observation is that for \ufb01xed \u03a3, k, and m, given any distance d(a, b) = d, I(a, b) is\nalso a constant, regardless of the mismatch positions. As a result, intersection values can always be\npre-computed once, stored and looked up when necessary. To illustrate this, we show two examples\nfor m = 1, 2:\n\ni(cid:1)(|\u03a3| \u2212 1)i. For higher values of Hamming dis-\n\ni=0(cid:0)k\n\n|Nk,m|, d(a, b) = 0\n|\u03a3|, d(a, b) = 1\n2, d(a, b) = 2\n\nI(a, b)\n(m = 2)\n\n=\n\n|Nk,m|, d(a, b) = 0\n1 + k(|\u03a3| \u2212 1) + (k \u2212 1)(|\u03a3| \u2212 1)2, d(a, b) = 1\n1 + 2(k \u2212 1)(|\u03a3| \u2212 1) + (|\u03a3| \u2212 1)2, d(a, b) = 2\n6(|\u03a3| \u2212 1), d(a, b) = 3\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nI(a, b)\n(m = 1)\n\n=\uf8f1\uf8f2\n\uf8f3\n\n(cid:0)4\n2(cid:1), d(a, b) = 4\nIn general, the intersection size can be found in a weighted formPi wi(|\u03a3| \u2212 1)i and can be pre-\n\ncomputed in constant time.\n\n4 Mismatch Algorithm based on Suf\ufb01cient Statistics\n\nIn this section, we further develop ideas from the previous section and present an improved mismatch\nalgorithm that does not require pairwise comparison of the k-mers between two strings and dependes\nlinearly on sequence length. The crucial observation is that in Equation 5, I(a, b) is non-zero\nonly when d(a, b) \u2264 2m. As a result, the kernel computed in Equation 5 is incremented only by\nmin(2m, k) + 1 distinct values, corresponding to min(2m, k) + 1 possible intersection sizes. We\nthen can re-write the equation in the following form:\n\nK(X, Y |m, k) =\n\nnx\u2212k+1\n\nny\u2212k+1\n\nXix=1\n\nXiy=1\n\nI(xix:ix+k\u22121, yiy:iy+k\u22121) =\n\nmin(2m,k)\n\nXi=0\n\nMiIi,\n\n(6)\n\nwhere Ii is the size of the intersection of k-mer mutational neighborhood for Hamming distance\ni, and Mi, the number of observed k-mer pairs in X and Y having Hamming distance i. The\nproblem of computing the kernel has been further reduced to a single summation. We have shown\nin Section 3.3 that given any i, we can compute Ii in advance. The crucial task now becomes\ncomputing the suf\ufb01cient statistics Mi ef\ufb01ciently. In the following, we will show how to compute the\nmismatch statistics {Mi} in O(ck,m(nx + ny)) time, where ck,m is a constant that does not depend\non the alphabet size. We formulate the task of inferring matching statistics {Mi} as the following\nauxiliary counting problem:\n\nMismatch Statistic Counting: Given a set of n k-mers from two strings X and Y ,\nfor each Hamming distance i = 0, 1, ..., min(2m, k), output the number of k-mer\npairs (a, b), a \u2208 X, b \u2208 Y with d(a, b) = i.\n\n\fIn this problem it is not necessary to know the distance between each pair of k-mers; one only\nneeds to know the number of pairs (Mi) at each distance i. We show next that the above problem\nof computing matching statistics can be solved in linear time (in the number n of k-mers) using\nmultiple rounds of counting sort as a sub-algorithm.\n\nWe \ufb01rst consider the problem of computing number of k-mers at distance 0, i.e.\nthe number of\nexact matches. In this case, we can apply counting sort to order all k-mers lexicographically and\n\ufb01nd the number of exact matches by scanning the sorted list. The counting then requires linear\nO(kn) time. Ef\ufb01cient direct computation of Mi for any i > 0 is dif\ufb01cult (requires quadratic time);\nwe take another approach and \ufb01rst compute inexact cumulative mismatch statistics, Ci = Mi +\n\nPi\u22121\nj=0(cid:0)k\u2212j\nusing (cid:0)k\n\ni\u2212j(cid:1)Mj, that overcount the number of k-mer pairs at a given distance i, as follows. Consider\n\ntwo k-mers a and b. Pick i positions and remove from the k-mers the symbols at the corresponding\npositions to obtain (k\u2212i)-mers a\u2032 and b\u2032. The key observation is that d(a\u2032, b\u2032) = 0 \u21d2 d(a, b) \u2264 i.\nAs a result, given n k-mers, we can compute the cumulative mismatch statistics Ci in linear time\n\nobtained from Ci by subtracting the exact counts to compensate for overcounting as follows:\n\ni(cid:1) rounds of counting sort on (k \u2212 i)-mers. The exact mismatch statistics Mi can then be\n\nMi = Ci \u2212\n\ni = 0, . . . , min(min(2m, k), k \u2212 1)\n\n(7)\n\nThe last mismatch statistic Mk can be computed by subtracting the preceding statistics M0, ...Mk\u22121\nfrom the total number of possible matches:\n\nMk = T \u2212\n\nMj, where T = (nx \u2212 k + 1)(ny \u2212 k + 1).\n\n(8)\n\ni\u22121\n\nXj=0(cid:18)k \u2212 j\n\ni \u2212 j(cid:19)Mj,\n\nk\u22121\n\nXj=0\n\nOur algorithm for mismatch kernel computations based on suf\ufb01cient statistics is summarized in\nAlgorithm 2. The overall complexity of the algorithm is O(nck,m) with the constant ck,m =\n\nl=0\n\nPmin(2m,k)\n\nof counting sort for evaluating the cumulative mismatch statistics Cl.\n\n(cid:0)k\nl(cid:1)(k \u2212 l), independent of the size of the alphabet set, and(cid:0)k\n\nl(cid:1) is the number of rounds\n\nAlgorithm 2. (Mismatch-SS) Mismatch kernel algorithm based on Suf\ufb01cient Statistics\nInput: strings X, Y, |X| = nx, |Y | = ny, parameters k, m, pre-computed intersection values I\n1. Compute min(2m, k) cumulative matching statistics, Ci, using counting sort\n2. Compute exact matching statistics, Mi, using Equation 7\n\n3. Evaluate kernel using Equation 6: K(X, Y |m, k) =Pmin(2m,k)\n\nOutput: Mismatch kernel value K(X, Y |k, m)\n\ni=0\n\nMiIi\n\n5 Extensions\n\nOur algorithmic approach can also be applied to a variety of existing string kernels, leading to very\nef\ufb01cient and simple algorithms that could bene\ufb01t many applications.\n\nSpectrum Kernels. The spectrum kernel [9] in our notation is the \ufb01rst suf\ufb01cient statistic M0, i.e.\nK(X, Y |k) = M0, which can be computed in k rounds of counting sort (i.e. in O(kn) time).\n\nGapped Kernels. The gapped kernels [6] measure similarity between strings X and Y based on the\nco-occurrence of gapped instances g, |g| = k + m > k of k-long substrings:\n\nK(X, Y |k, g) = X\u03b3\u2208\u03a3k(cid:16) Xg\u2208X,|g|=k+m\n\nI(\u03b3, g)(cid:17)(cid:16) Xg\u2208Y,|g|=k+m\n\nI(\u03b3, g)(cid:17),\n\n(9)\n\nwhere I(\u03b3, g) = 1 when \u03b3 is a subsequence of g. Similar to the algorithmic approach for extracting\ncumulative mismatch statistics in Algorithm-2, to compute the gapped(g,k) kernel, we perform a\nsingle round of counting sort over k-mers contained in the g-mers. This gives a very simple and\n\nWildcard kernels. The wildcard(k,m) kernel [6] in our notation is the sum of the cumulative\n\nef\ufb01cient O((cid:0)g\nstatistics K(X, Y |k, m) = Pm\ngiving a simple and ef\ufb01cient O(Pm\n\nk(cid:1)kn) time algorithm for gapped kernel computations.\ni=0 Ci, i.e. can be computed in Pm\ni=0(cid:0)k\n\ni(cid:1)(k \u2212 i)n) algorithm.\n\ni=0(cid:0)k\n\ni(cid:1) rounds of counting sort,\n\n\fSpatial kernels. The spatial(k,t,d) kernel [13] can be computed by sorting kt-mers iteratively for\nevery arrangement of t k-mers spatially constrained by distance d.\n\nNeighborhood Kernels. The sequence neighborhood kernels [7] proved to be a powerful tool in\nmany sequence analysis tasks. The method uses the unlabeled data to form a set of neighbors for\ntrain/test sequences and measure similarity of two sequences X and Y using their neighborhoods:\n\nK(X, Y ) = Xx\u2208N (X) Xy\u2208N (Y )\n\nK(x, y)\n\n(10)\n\nwhere N (X) is the sequence neighborhood that contains neighboring sequences from the unlabeled\ndata set, including X itself. Note the kernel value, if computed directly using Equation 10, will incur\nquadratic complexity in the size of the neighborhoods. Similar to the single string case, using our\nalgorithmic approach, to compute the neighborhood kernel (over the string sets), we can jointly sort\nthe observed k-mers in N (X) and N (Y ) and apply the desired kernel evaluation method (spectrum,\nmismatch, or gapped). Under this setting, the neighborhood kernel can be evaluated in time linear to\nthe neighborhood size. This leads to very ef\ufb01cient algorithms for computing sequence neighborhood\nkernels even for very large datasets, as we will show in the experimental section.\n\n6 Evaluation\n\nWe study the performance of our algorithms, both in running time and predictive accuracy, on syn-\nthetic data and standard benchmark datasets for protein sequence analysis and music genre classi-\n\ufb01cation. The reduced running time requirements of our algorithms open the possibility to consider\n\u201dlooser\u201d mismatch measures with larger k and m. The results presented here demonstrate that such\nmismatch kernels with larger (k, m) can lead to state-of-the-art predictive performance even when\ncompared with more complex models such as [2].\n\nWe use three standard benchmark datasets to compare with previously published results: the SCOP\ndataset (7329 sequences with 2862 labeled) [7] for remote protein homology detection, the Ding-\nDubchak dataset1 (27 folds, 694 sequences) [14, 15] for protein fold recognition, and music genre\ndata2 (10 classes, 1000 sequences, |\u03a3| = 1024) [16] for multi-class genre prediction. For protein\nsequence classi\ufb01cation under the semi-supervised setting, we also use the Protein Data Bank (PDB,\n17, 232 sequences), the Swiss-Prot (101, 602 sequences), and the non-redundant (NR) databases\nas the unlabeled datasets, following the setup of [17]. All experiments are performed on a single\n2.8GHz CPU. The datasets used in our experiments and the suplementary data/code are available\nat http://seqam.rutgers.edu/new-inexact/new-inexact.html.\n\n6.1 Running time analysis\n\nWe compare the running time of our algorithm on synthetic and real data with the trie-based com-\nputations. For synthetic data, we generate strings of length n = 105 over alphabets of different\nsizes and measure the running time of the trie-based and our suf\ufb01cient statistics based algorithms\nfor evaluating mismatch string kernel. Figure 1 shows relative running time Ttrie/Tss, in logarith-\nmic scale, of the mismatch-trie and mismatch-SS as a function of the alphabet size. As can be seen\nfrom the plot, our algorithm demonstrates several orders of magnitude improvements, especially for\nlarge alphabet sizes.\n\nTable 1 compares running times of our algorithm and the trie-based algorithm for different real\ndataset (proteins, DNA, text, music) for a single kernel entry (pair of strings) computation. We\nobserve the speed improvements ranging from 100 to 106 times depending on the alphabet size.\n\nWe also measure the running time for full 7329-by-7329 mismatch(5,2) kernel matrix computa-\ntions for SCOP dataset under the supervised setting. The running time of our algorithm is 1525\nseconds compared to 196052 seconds for the trie-based computations. The obtained speed-up of\n128 times is as expected from the theoretical analysis (our algorithm performs 31 counting-sort\niterations in total over 5-, 4-, 3-, 2-, and 1- mers, which gives the running time ratio of approxi-\nmately 125 when compared to the trie-based complexity). We observe similar improvements under\n\n1http://ranger.uta.edu/\u02dcchqding/bioinfo.html\n2http://opihi.cs.uvic.ca/sound/genres\n\n\f5\n10\n\n4\n10\n\ns\ns\n\nT\n\n/\n\ne\ni\nr\nt\n\nT\n\n \n,\n\ne\nm\n\n3\n10\n\ni\nt\n \n\ni\n\ng\nn\nn\nn\nu\nr\n \ne\nv\ni\nt\n\nl\n\na\ne\nr\n\n2\n10\n\n1\n10\n\n0\n10\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\nalphabet size\n\nFigure 1: Relative running time Ttrie/Tss\n(in logarithmic scale) of the mismatch-trie and\nmismatch-ss as a function of the alphabet size\n(mismatch(5,1) kernel, n = 105)\n\nTable 1: Running time (in seconds) for kernel compu-\ntation between two strings on real data\n\nlong\nprotein\n36672\n\n20\n\n1.6268\n0.1987\n\n8\n\n31.5519\n0.2957\n\n100\n\nprotein\n\n116\n20\n\ndna\n\n570\n\n4\n\n0.0212\n0.0052\n\n0.0260\n0.0054\n\n4\n\n0.2918\n0.0067\n\n44\n\n5\n\n0.4800\n0.0064\n\n75\n\ntext\n\n242\n\n29224\n20398\n0.0178\n\n106\n\n-\n\nmusic\n\n6892\n1024\n526.8\n0.0331\n16,000\n\n-\n\n0.0649\n\n0.0941\n\n-\n\n-\n\nn\n|\u03a3|\n(5,1)-trie\n(5,1)-ss\ntime ratio\n(5,2)-trie\n(5,2)-ss\ntime ratio\n\nthe semi-supervised setting for neighborhood mismatch kernels; for example, computing a smaller\nneighborhood mismatch(5,2) kernel matrix for the labeled sequences only (2862-by-2862 matrix)\nusing the Swiss-Prot unlabeled dataset takes 1, 480 seconds with our algorithm, whereas performing\nthe same task with the trie-based algorithm takes about 5 days.\n\n6.2 Empirical performance analysis\n\nIn this section we show predictive performance results for several sequence analysis tasks using our\nnew algorithms. We consider the tasks of the multi-class music genre classi\ufb01cation [16], with results\nin Table 2, and the protein remote homology (superfamily) prediction [9, 2, 18] in Table 3. We also\ninclude preliminary results for multi-class fold prediction [14, 15] in Table 4.\n\nOn the music classi\ufb01cation task, we observe signi\ufb01cant improvements in accuracy for larger number\nof mismatches. The obtained error rate (35.6%) on this dataset compares well with the state-of-the-\nart results based on the same signal representation in [16]. The remote protein homology detection,\nas evident from Table 3, clearly bene\ufb01ts from larger number of allowed mismatches because the\nremotely related proteins are likely to be separated by multiple mutations or insertions/deletions.\nFor example, we observe improvement in the average ROC-50 score from 41.92 to 52.00 under a\nfully-supervised setting, and similar signi\ufb01cant improvements in the semi-supervised settings. In\nparticular, the result on the Swiss-Prot dataset for the (7, 3)-mismatch kernel is very promising and\ncompares well with the best results of the state-of-the-art, but computationally more demanding,\npro\ufb01le kernels [2]. The neighborhood kernels proposed by Weston et al. have already shown very\npromising results in [7], though slightly worse than the pro\ufb01le kernel. However, using our new\nalgorithm that signi\ufb01cantly improves the speed of the neighborhood kernels, we show that with\nlarger number of allowed mismatches the neighborhood can perform even better than the state-\nof-the-art pro\ufb01le kernel: the (7,3)-mismatch neighborhood achieves the average ROC-50 score of\n86.32, compared to 84.00 of the pro\ufb01le kernel on the Swiss-Prot dataset. This is an important result\nthat addresses a main drawback of the neighborhood kernels, the running time [7, 2].\n\nTable 3: Classi\ufb01cation performance on protein remote homology\nprediction\ndataset\n\nTable 2: Classi\ufb01cation per-\nformance on music genre\nclassi\ufb01cation (multi-class)\nMethod\nMismatch (5,1) 42.6\u00b16.34\nMismatch (5,2) 35.6\u00b14.99\n\nError\n\nSCOP (supervised) 87.75 41.92\nSCOP (unlabeled) 90.93 67.20\nSCOP (PDB)\n97.06 80.39\nSCOP (Swiss-Prot) 96.73 81.05\n\nmismatch (5,1) mismatch (5,2) mismatch (7,3)\nROC ROC50 ROC ROC50 ROC ROC50\n52.00\n73.29\n84.56\n86.32\n\n90.67 49.09\n91.42 69.35\n97.24 81.35\n97.05 82.25\n\n91.31\n92.27\n97.93\n97.78\n\nFor multi-class protein fold recognition (Table 4), we similarly observe improvements in perfor-\nmance for larger numbers of allowed mismatches. The balanced error of 25% for the (7,3)-mismatch\nneighborhood kernel using Swiss-Prot compares well with the best error rate of 26.5% for the state-\n\n\fof-the-art pro\ufb01le kernel with adaptive codes in [15] that used a much larger non-redundant (NR)\ndataset. Using NR, the balanced error further reduces to 22.5% for the (7,3)-mismatch.\n\nTable 4: Classi\ufb01cation performance on fold prediction (multi-class)\n\nMethod\n\nError\n\nTop 5\nError\n\nBalanced\nError\n\nRecall\n\nTop 5\nRecall\n\nPrecision\n\nTop 5\nPrecision\n\nF1\n\nTop5\nF1\n\nTop 5\nBalanced\nError\n28.86\n22.66\n13.36\n22.76\n12.57\n12.14\n\nMismatch (5, 1) 51.17 22.72 53.22\nMismatch (5, 2) 42.30 19.32 44.89\nMismatch (5, 2)\u2020 27.42 14.36 24.98\nMismatch (7, 3) 43.60 19.06 47.13\nMismatch (7, 3)\u2020 26.11 12.53 25.01\nMismatch (7, 3)\u2021 23.76 11.75 22.49\n\u2020 used the Swiss-Prot sequence database; \u2021 used NR (non-redundant) database\n\n46.78 71.14 90.52\n55.11 77.34 67.36\n75.02 86.64 79.01\n52.87 77.24 84.65\n74.99 87.43 85.00\n77.59 87.86 84.90\n\n95.25\n84.77\n91.02\n91.95\n92.78\n91.99\n\n61.68 81.45\n60.62 80.89\n76.96 88.78\n65.09 83.96\n79.68 90.02\n81.04 89.88\n\n7 Conclusions\n\nWe presented new algorithms for inexact matching of the discrete-valued string representations that\nreduce computational complexity of current algorithms, demonstrate state-of-the-art performance\nand signi\ufb01cantly improved running times. This improvement makes the string kernels with approxi-\nmate but looser matching a viable alternative for practical tasks of sequence analysis. Our algorithms\nwork with large databases in supervised and semi-supervised settings and scale well in the alphabet\nsize and the number of allowed mismatches. As a consequence, the proposed algorithms can be\nreadily applied to other challenging problems in sequence analysis and mining.\n\nReferences\n\n[1] Jianlin Cheng and Pierre Baldi. A machine learning information retrieval approach to protein fold recog-\n\nnition. Bioinformatics, 22(12):1456\u20131463, June 2006.\n\n[2] Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund, and Christina S. Leslie.\nPro\ufb01le-based string kernels for remote homology detection and motif extraction. In CSB, pages 152\u2013\n160, 2004.\n\n[3] Christina S. Leslie, Eleazar Eskin, Jason Weston, and William Stafford Noble. Mismatch string kernels\n\nfor SVM protein classi\ufb01cation. In NIPS, pages 1417\u20131424, 2002.\n\n[4] S\u00a8oren Sonnenburg, Gunnar R\u00a8atsch, and Bernhard Sch\u00a8olkopf. Large scale genomic sequence SVM clas-\n\nsi\ufb01ers. In ICML \u201905, pages 848\u2013855, New York, NY, USA, 2005.\n\n[5] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University\n\nPress, New York, NY, USA, 2004.\n\n[6] Christina Leslie and Rui Kuang. Fast string kernels using inexact matching for protein sequences. J.\n\nMach. Learn. Res., 5:1435\u20131455, 2004.\n\n[7] Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff, and William Stafford Noble.\n\nSemi-supervised protein classi\ufb01cation using cluster kernels. Bioinformatics, 21(15):3241\u20133247, 2005.\n\n[8] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998.\n[9] Christina S. Leslie, Eleazar Eskin, and William Stafford Noble. The spectrum kernel: A string kernel for\n\nSVM protein classi\ufb01cation. In Paci\ufb01c Symposium on Biocomputing, pages 566\u2013575, 2002.\n\n[10] S. V. N. Vishwanathan and Alex Smola. Fast kernels for string and tree matching. Advances in Neural\n\nInformation Processing Systems, 15, 2002.\n\n[11] M. Gribskov, A.D. McLachlan, and D. Eisenberg. Pro\ufb01le analysis: detection of distantly related proteins.\n\nProceedings of the National Academy of Sciences, 84:4355\u20134358, 1987.\n\n[12] Juho Rousu and John Shawe-Taylor. Ef\ufb01cient computation of gapped substring kernels on large alphabets.\n\nJ. Mach. Learn. Res., 6:1323\u20131344, 2005.\n\n[13] Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Fast protein homology and fold detection with\n\nsparse spatial sample kernels. In ICPR 2008, 2008.\n\n[14] Chris H.Q. Ding and Inna Dubchak. Multi-class protein fold recognition using support vector machines\n\nand neural networks. Bioinformatics, 17(4):349\u2013358, 2001.\n\n[15] Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, and Christina Leslie. Multi-class protein\n\nclassi\ufb01cation using adaptive codes. J. Mach. Learn. Res., 8:1557\u20131581, 2007.\n\n[16] Tao Li, Mitsunori Ogihara, and Qi Li. A comparative study on content-based music genre classi\ufb01cation.\n\nIn SIGIR \u201903, pages 282\u2013289, New York, NY, USA, 2003. ACM.\n\n[17] Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. On the role of local matching for ef\ufb01cient semi-\n\nsupervised protein sequence classi\ufb01cation. In BIBM, 2008.\n\n[18] Tommi Jaakkola, Mark Diekhans, and David Haussler. A discriminative framework for detecting remote\n\nprotein homologies. In Journal of Computational Biology, volume 7, pages 95\u2013114, 2000.\n\n\f", "award": [], "sourceid": 373, "authors": [{"given_name": "Pavel", "family_name": "Kuksa", "institution": null}, {"given_name": "Pai-hsi", "family_name": "Huang", "institution": null}, {"given_name": "Vladimir", "family_name": "Pavlovic", "institution": null}]}