{"title": "Mismatch String Kernels for SVM Protein Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1441, "page_last": 1448, "abstract": null, "full_text": "Mismatch String Kernels for SVM Protein\n\nClassi\ufb01cation\n\nChristina Leslie\n\nEleazar Eskin\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nColumbia University\n\nColumbia University\n\ncleslie@cs.columbia.edu\n\neeskin@cs.columbia.edu\n\nJason Weston\n\nMax-Planck Institute\nTuebingen, Germany\n\nWilliam Stafford Noble\n\nDepartment of Genome Sciences\n\nUniversity of Washington\n\nweston@tuebingen.mpg.de\n\nnoble@gs.washington.edu\n\nAbstract\n\nWe introduce a class of string kernels, called mismatch kernels, for use\nwith support vector machines (SVMs) in a discriminative approach to\nthe protein classi\ufb01cation problem. These kernels measure sequence sim-\nilarity based on shared occurrences of \u0001 -length subsequences, counted\nwith up to\nmismatches, and do not rely on any generative model for\nthe positive training sequences. We compute the kernels ef\ufb01ciently using\na mismatch tree data structure and report experiments on a benchmark\nSCOP dataset, where we show that the mismatch kernel used with an\nSVM classi\ufb01er performs as well as the Fisher kernel, the most success-\nful method for remote homology detection, while achieving considerable\ncomputational savings.\n\n1 Introduction\n\nA fundamental problem in computational biology is the classi\ufb01cation of proteins into func-\ntional and structural classes based on homology (evolutionary similarity) of protein se-\nquence data. Known methods for protein classi\ufb01cation and homology detection include\npairwise sequence alignment [1, 2, 3], pro\ufb01les for protein families [4], consensus patterns\nusing motifs [5, 6] and pro\ufb01le hidden Markov models [7, 8, 9]. We are most interested\nin discriminative methods, where protein sequences are seen as a set of labeled examples\n\u2014 positive if they are in the protein family or superfamily and negative otherwise \u2014 and\nwe train a classi\ufb01er to distinguish between the two classes. We focus on the more dif\ufb01cult\nproblem of remote homology detection, where we want our classi\ufb01er to detect (as positives)\ntest sequences that are only remotely related to the positive training sequences.\n\nOne of the most successful discriminative techniques for protein classi\ufb01cation \u2013 and the\nbest performing method for remote homology detection \u2013 is the Fisher-SVM [10, 11] ap-\nproach of Jaakkola et al. In this method, one \ufb01rst builds a pro\ufb01le hidden Markov model\n\n\u0003 Formerly William Noble Grundy: see http://www.cs.columbia.edu/\u02dcnoble/name-change.html\n\n\u0002\n\f\r\u0004\u000e\u001b\u001a\n\n. If \r\u0011\u0010\n\nis the maximum likelihood estimate for the model param-\n\n(HMM) for the positive training sequences, de\ufb01ning a log likelihood function \u0002\u0001\u0004\u0003\u0006\u0005\b\u0007\n\t\f\u000b\n\r\u000f\u000e\nfor any protein sequence \t\n\u0013\u001d\u001c\u001e\u0013 \u001f assigns to each (positive or negative)\neters, then the gradient vector \u0012\u0014\u0013\u0015\u0016\u0001\u0017\u0003\u0018\u0005\b\u0007\n\t\u0019\u000b\ntraining sequence \t an explicit vector of features called Fisher scores. This feature map-\n\nping de\ufb01nes a kernel function, called the Fisher kernel, that can then be used to train a\nsupport vector machine (SVM) [12, 13] classi\ufb01er. One of the strengths of the Fisher-SVM\napproach is that it combines the rich biological information encoded in a hidden Markov\nmodel with the discriminative power of the SVM algorithm. However, one generally needs\na lot of data or sophisticated priors to train the hidden Markov model, and because calcu-\nlating the Fisher scores requires computing forward and backward probabilities from the\nBaum-Welch algorithm (quadratic in sequence length for pro\ufb01le HMMs), in practice it is\nvery expensive to compute the kernel matrix.\n\nIn this paper, we present a new string kernel, called the mismatch kernel, for use with an\n\nSVM for remote homology detection. The \u0007\n\nmap to a vector space indexed by all possible subsequences of amino acids of a \ufb01xed\nlength \u0001 ; each instance of a \ufb01xed \u0001 -length subsequence in an input sequence contributes\nto all feature coordinates differing from it by at most\nmismatches. Thus, the mismatch\nkernel adds the biologically important idea of mismatching to the computationally simpler\nspectrum kernel presented in [14]. In the current work, we also describe how to compute\n\n\u0002\u0014\u000e -mismatch kernel is based on a feature\n\n\u0001\"!\n\nthe new kernel ef\ufb01ciently using a mismatch tree data structure; for values of \u0007\n\nin this application, the kernel is fast enough to use on real datasets and is considerably\nless expensive than the Fisher kernel. We report results from a benchmark dataset on the\nSCOP database [15] assembled by Jaakkola et al. [10] and show that the mismatch kernel\nused with an SVM classi\ufb01er achieves performance equal to the Fisher-SVM method while\noutperforming all other methods tested. Finally, we note that the mismatch kernel does\nnot depend on any generative model and could potentially be used in other sequence-based\nclassi\ufb01cation problems.\n\n\u0002\u0014\u000e useful\n\n\u0001\"!\n\n2 Spectrum and Mismatch String Kernels\n\nThe basis for our approach to protein classi\ufb01cation is to represent protein sequences as\nvectors in a high-dimensional feature space via a string-based feature map. We then train\na support vector machine (SVM), a large-margin linear classi\ufb01er, on the feature vectors\nrepresenting our training sequences. Since SVMs are a kernel-based learning algorithm,\nwe do not calculate the feature vectors explicitly but instead compute their pairwise inner\nproducts using a mismatch string kernel, which we de\ufb01ne in this section.\n\n2.1 Feature Maps for Strings\n\n\u0001\"!\n\n\u0001\"!\n\n$%\u000b\u0011&('\n\nof size \u000b\n\n. (For protein sequences, $\n\n\u0002\u0014\u000e -neighborhood generated by .\n\n\u0002#\u000e -mismatch kernel is based on a feature map from the space of all \ufb01nite sequences\nto the '\u001d) -dimensional vector space indexed by the set\n, with each 0;: a character in $\n\nThe \u0007\nfrom an alphabet $\nof \u0001 -length subsequences (\u201c\u0001 -mers\u201d) from $\namino acids, '*&,+\u001b- .) For a \ufb01xed \u0001 -mer ./&,0\u001e1204365758590\nthe \u0007\ndiffer from . by at most\nWe de\ufb01ne our feature map FG>\nwhere JRL\u001e\u0007E.\f\u000eS&UT\n\u0007E.\f\u000e , and J\u0011L\u001e\u0007W.\u0019\u000eX&Y- otherwise. Thus, a \u0001 -mer\nFor a sequence Z of any length, we extend the map additively by summing the feature\n\nas follows: if .\n)A@\nBDC\nif < belongs to =V>\n\nmismatches. We denote this set by =?>\n)A@\n\nis the set of all \u0001 -length sequences <\n\u0007E.\f\u000e .\n\ncontributes weight to all the coordinates in its mismatch neighborhood.\n\n\u0007E.\f\u000e\u0006&I\u0007KJMLN\u0007E.\f\u000e9\u000e\nBDC\n)A@\n\nis the alphabet of\n,\nthat\n\nis a \u0001 -mer, then\n\nfrom $\n\nBDC\nFH>\n\n)A@\n\nBDC\n\nL;O\u0017P\u0019Q\n\n(1)\n\n\u001a\n\u0002\n)\n\u0002\n\fWhile we de\ufb01ne the spectrum and mismatch feature maps without any reference to a gen-\nerative model for the positive class of sequences, there is some similarity between the\n\nmodel. More precisely, suppose the generative model for the positive training sequences is\ngiven by\n\nT Markov chain\n\n58575\n\ntimate for \r on the positive training set. To calculate the Fisher scores for this model,\nwe follow [10] and de\ufb01ne independent variables \n\nsatisfying\n\n\u0004\u000e\n\nthe maximum likelihood es-\n\nT . Then the Fisher scores are given by\n\nFH>\n\n\u0007KZ\"\u000e\nis just a count of all instances of the \u0001 -mer <\n\u0007KZR\u000e\nBDC\n)A@\nmismatches in Z\n. The \u0007\n\n\u0007W.\u0019\u000e\n\nBDC\n\n\u0001\"!\n\n)A@\n\nis the inner\n\nB*C\n\nFor\n\n!\u0006\u0005\n\nFH>\n\n)A@\n\n\u0004\u000e\n\n\u0005\b\u0007E\t\n\n57575 \t\n\n\u0007KZ\n\n)A@\n\nBDC\n\n\u0007KZ\"\u000e\n\nB*C\n\n\u0004\u0014>\n\n, with parameters\n\n\u0005\b\u0007\n\t\n\t\u001e1\n\n58575\n\t\u00113658575 \t\n\n2.2 Fisher Scores and the Spectrum Kernel\n\n\u0005\b\u0007KZ\f\u000b\n\n\u0004\u000e\nfor a string ZV&\n\nvectors for all the \u0001 -mers in Z :\nFH>\n\nB*C\nNote that the < -coordinate of FG>\n\noccurring with up to\nproduct in feature space of feature vectors:\n\n-mers\u0002\n&\b\u0007KFH>\n\nBDC\n&/- , we retrieve the \u0001 -spectrum kernel de\ufb01ned in [14].\n\nin\u0003\n\u0002#\u000e -mismatch kernel\u0004\n\u000e\n\t25\n\u0001 -spectrum feature map and the Fisher scores associated to an order \u0001\f\u000b\n\r\u0004\u000e\"58575 \u0005\b\u0007E\t\u0010\u000f\u001e\u000b\n\t\u0010\u000f\n)\u000e\n\n\")\u0012\u0011\nQ\u0012#\n585759\t\u0014\u0013\n&\u0018\u0017\n&\u0016\u00157\u000b\n\t\u0014\u0013\n\u0005\b\u0007\n\t\u0014\u0013\n\r\u001b\u001a\u001d\u001c\n\u001e\n\u001f! \" \" \n)\u000e\r\n\u0017\u00041\nfor characters\u0015\n)\u000e\r\n\u0013\u0019$&%\n\u001f\u0019(\nQ!#\n\u001e\n\u001f! \" \" \n\u0013\u0019$\n$+*\nQ\u0012#\nQ!#\nQ!#\n \" \" \n \" \" \n\u001a\u001d\u001c\n \" \" \n,,\nQ!#\n \" \" \nQ!#\nQ!#\nQ\u0012#\n\u001a\u001d\u001c\n\u001e\n\u001f! \" \" \n \" \" \n \" \" \nQ!#\n \" \" \n\u001a\u001d\u001c\nQ\u0012#\n \" \" \nQ!#\n\u001e\n\u001f\u0012 \" \" \n\u001a\u001d\u001c\n57585\u0019\u0017\nwhere.\nis the number of instances of the \u0001 -mer\u0017\nQ!#\n)\u000e\r\n \" \" \n\u001a\u001d\u001c\n\u000e -mer\u0017M1\u001957585\u0019\u0017\n)\u000e\r\n16\u0015\ndegree to which the \u0001 -mer\u0017\n&7.\nQ!#\n\u001e\n\u001f! \" \" \n \" \" \n\u001a\u001d\u001c\n\nis the number of instances of the \u0007\nsimply uses the unweighted count: J\n3 Ef\ufb01cient Computation of the Mismatch Kernel\n\n. Denote by \n\n1\u001958575!\u0017\n\nin alphabet $\n\n57585\u0019\u0017\n\n\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007WZ\f\u000b\n\n\u0004\u000e\n\n\u00014\u000b\n)5\n\n\")\u0012\u0011\n\n)\u000e\n\nQ!#\n\n&/\n\n\u0013\u001d\u001c\u001e\u0013\n\n\u0007KZ\"\u000e\n\n\u0004\u000e\n\n575859\t\u0010\u000f\nQ\u0012#\nQ!#\n\u001f\u0012(\n\u001a\u001d\u001c\n\u001e\n\u001f\u0012 \" \" \n\u001a32\n, and.\n\nin Z\n\nQ!#\n\u001e3\u001f! \" \" \n\nQ\u0012#\n\nis over- or under-represented relative to the positive\nmodel. For the \u0001 -spectrum kernel, the corresponding feature coordinate looks similar but\n\n1 . Thus the Fisher score captures the\n\nUnlike the Fisher vectors used in [10], our feature vectors are sparse vectors in a very high\ndimensional feature space. Thus, instead of calculating and storing the feature vectors, we\ndirectly and ef\ufb01ciently compute the kernel matrix for use with an SVM classi\ufb01er.\n\n3.1 Mismatch Tree Data Structure\n\nWe use a mismatch tree data structure (similar to a trie or suf\ufb01x tree [16, 17]) to represent\nthe feature space (the set of all \u0001 -mers) and perform a lexical traversal of all \u0001 -mers oc-\nof mismatches; the entire kernel matrix\ncurring in the sample dataset match with up to\n\n)\n@\n&\n\u0001\n)\n\u0002\n>\n)\n@\n\u000e\n)\n@\n!\n\u0007\n\u0005\n\u0002\n&\n1\n\t\n1\n\u000b\n)\n\u000b\n\t\n1\n1\n!\n1\n\n1\n!\n\u000f\n1\n\n1\n1\n1\n!\n&\n\u001e\n\u001f\n!\n!\n!\n\u0017\n1\n\u0010\n\u001a\n@\n\u001e\n\u001f\n&\n'\n(\n(\n'\n\u001f\n)\n*\n%\n'\n(\n(\n'\n\u001f\n\n\u001a\n@\n\u001e\n\u001f\n\u001e\n\u001f\n\u0010\n\u001e\n\u001f\n\u001e\n\u001f\n\u0010\n\u001a\n*\n\n\u001a\n*\n@\n\u001e\n\u001f\n\u001e\n\u001f\n\u0010\n&\n-\n-\n\n\u001a\n@\n\u001e\n\u001f\n\u001e\n\u001f\n\u001a\n\u001a\n\u001a\n\u001a\n\u001f\n&\n.\n\u001e\n\u001f\n/\nT\n\u000b\n\n\u001a\n@\n\u001e\n\u001f\n\u001e\n\u001f\n\u0010\n\n\u001a\n@\n\u001e\n\u001f\n\u001e\n\u001f\n\u0010\n0\n\u000b\n\u0001\n1\n\u001c\n\u001a\n.\n1\n\u001e\n\u001f\n&\n.\n\u001e\n\u001f\n\u001e\n\u001f\n\n\u001e\n\u001f\n\u0010\n\u000b\n.\n\u001e\n\u001f\n\u001e\n\u001f\n!\n\u001e\n\u001f\n\u001e\n\u001f\n1\n1\n\u0015\n\u001e\n\u001f\nT\n\u001e\n\u001f\n\u001a\n\u001e\n\u001f\n\u001e\n\u001f\n5\n\u0002\n\f\u000b\"&\n\nthe tree.\n\n58575\u0004\u0003\n\nfor the sample of\n\nsequences is computed in one traversal of\n\n!\u0002\u0001\n\u000e ,\n\u0002#\u000e -mismatch tree is a rooted tree of depth \u0001 where each internal node has \u000b\n\n\u0007KZA:\n\u0001\u0015!\nA \u0007\nbranches and each branch is labeled with a symbol from $\n\n. A leaf node represents a \ufb01xed\n\u0001 -mer in our feature space \u2013 obtained by concatenating the branch symbols along the path\nfrom root to leaf \u2013 and an internal node represents the pre\ufb01x for those \u0001 -mer features which\nare its descendants in the tree. We use a depth-\ufb01rst search of this tree to store, at each node\nthat we visit, a set of pointers to all instances of the current pre\ufb01x pattern that occur with\nmismatches in the sample data. Thus at each node of depth\n, we maintain pointers to\nall substrings from the sample data set whose\nmismatches\nfrom the\n-length pre\ufb01x represented by the path down from the root. Note that the set of\nvalid substrings at a node is a subset of the set of valid substrings of its parent. When we\nencounter a node with an empty list of pointers (no valid occurrences of the current pre\ufb01x),\nwe do not need to search below it in the tree. When we reach a leaf node, we sum the\ncontributions of all instances occurring in each source sequence to obtain feature values\nfor\n\ncorresponding to the current \u0001 -mer, and we update the kernel matrix entry\u0004\neach pair of source sequences Z\n\u0006\n\nhaving non-zero feature values.\n\n-length pre\ufb01xes are within\n\nand Z\u000b\b\n\n\u0007WZ\u0007\u0006\n\nZ\t\b \u000e\n\nA\n\n0\nA\nV\nL\nA\nL\nK\nA\nV\n\n0 0\nV\nL\nA\nL\nL\nA\nK\nL\nA\nK\nV\nA\nL\nV\nL\nL\n\n0 1 1\nV\nA\nL\nL\nA\nK\nL\nA\nK\nV\nA\nL\nV\nL\n\nL\nA\nL\nK\nA\nV\nL\n\n(b)\n\n(a)\n\nA\n\n0\nA\nV\nL\nA\nL\nK\nA\nV\n\n0 0\nV\nL\nA\nL\nL\nA\nK\nL\nA\nK\nV\nA\nL\nV\nL\nL\n\n0 1 1\nV\nA\nL\nL\nA\nK\nL\nA\nK\nV\nA\nL\nV\nL\n\nL\nA\nL\nK\nA\nV\nL\n\nL\n\n(c)\n\nFigure 1: An\nnode down a path: (a) at the root node; (b) after expanding the path\npath\n\n-mismatch tree for a sequence AVLALKAVLL, showing valid instances at each\n; and (c) after expanding the\n\n. The number of mismatches for each instance is also indicated.\n\n0\nA\nV\nL\nA\nL\nK\nA\nV\n\n0 0\nV\nL\nA\nL\nL\nA\nK\nL\nA\nK\nV\nA\nL\nV\nL\nL\n\n11\nL\nA\nL\nA\nK\nL\nA\nK\nV\nA\nL\nV\n\n\u0001\"!\n\n\f\u000e\r\u0010\u000f\u0012\u0011\u0014\u0013\n\n\u0015\u0017\u0016\n\n3.2 Ef\ufb01ciency of the Kernel Computation\n\nSince we compute the kernel in one depth-\ufb01rst traversal, we do not actually need to store\nthe entire mismatch tree but instead compute the kernel using a recursive function, which\nmakes more ef\ufb01cient use of memory and allows kernel computations for large datasets.\n\n'WB\n\nTA\u000e\n\n\u001d\u001c\n\nThe number of \u0001 -mers within\n\nmismatches of any given \ufb01xed \u0001 -mer is\n\n'K\u000e\u0014&\n\u000e . Thus the effective number of \u0001 -mer instances that we\n'WB\n\n\u0007E'\n\u000e , where =\n3 updates to the kernel matrix. For\n\nneed to traverse grows as\na leaf node, if exactly\nperforms\n\nis the total length of the sample data. At\ninput sequences contain valid instances of the current \u0001 -mer, one\n(total length\nfeature vectors\nare all equal and have the maximal number of non-zero entries, giving worst case overall\nrunning time\nsmall values of\n\nsequences each of length.\n\nare most useful, and the kernel calculations are quite inexpensive.\n\n), the worst case for the kernel computation occurs when the\n\nWhen mismatch kernels are used in combination with SVMs, the learned classi\ufb01er\n\n\u001c\u001a\u0019\n&\b.!\u0003\n\n\u0018N\u0007\n\n'W\u000e \u000e\n\n\u0001\"!\n\n\u0007E=\n\n\u000e . For the application we discuss here,\n$\f\u0007WZ\"\u000e6&\n\n&\u001f\u001e\n.#\u0018\u0019\u0007\n\n&\u001f\u001e\n\n\u0007\"\u0003\n\n\u0007\"\u0003\n\n\u0004\n!\nZ\n\u0013\n\n&\nT\n\u0003\n$\n'\n\u0005\n\u0005\n\u0002\n\u0005\n!\n\u0015\n\u0002\n\u0002\n!\n,\nB\n:\n\u001b\n\u0001\n\u000b\n:\n\u0007\n\u0001\nB\n\u001e\n\u0001\nB\n \n \n\u0003\n=\n\u0003\n\u001e\n3\n\u0002\n!\n3\n.\n\u0001\nB\n'\nB\n\u0002\n\f,\u0001\n\n(where Z\n\nFH>\nBDC\n)A@\nsupport vectors, \u0002\nand storing per \u0001 -mer scores. Then the prediction\nlook-up of \u0001 -mer scores. In practice, one usually wants to use a normalized feature map,\n\n: are the training sequences that map to\n: are weights) can be implemented by pre-computing\n$\f\u0007WZ\"\u000e can be calculated in linear time by\n\u0007WZ\"\u000e , with complexity\n\u00079TA\u000e normalization schemes, like dividing\n\nso one would also need to compute the norm of the vector F\u0014>\n\nby sequence length, can also be used.\n\n. Simple\n\nBDC\n\n)A@\n\n'WB\n\n)A@\n\nBDC\n\nFH>\n\n\u0007KZ8:K\u000e\n: are labels, and .\n\n:W.\u0019:\n\u0007\n\u0007WZ\"\u000e3\t\u0004\u0003\u0006\u0005\n\u000e for a sequence of length.\n\n4 Experiments: Remote Protein Homology Detection\n\nWe test the mismatch kernel with an SVM classi\ufb01er on the SCOP [15] (version 1.37)\ndatasets designed by Jaakkola et al. [10] for the remote homology detection problem. In\nthese experiments, remote homology is simulated by holding out all members of a target\nSCOP family from a given superfamily. Positive training examples are chosen from the\nremaining families in the same superfamily, and negative test and training examples are\nchosen from disjoint sets of folds outside the target family\u2019s fold. The held-out family\nmembers serve as positive test examples. In order to train HMMs, Jaakkola et al. used the\nSAM-T98 algorithm to pull in domain homologs from the non-redundant protein database\nand added these sequences as positive examples in the experiments. Details of the datasets\nare available at www.soe.ucsc.edu/research/compbio/discriminative.\n\nslightly better performance,\n\n. We tested \u0007\n\n\u0001\"!\n\n\u0004 Norm\n\nfor \u0007\n\nB*C\n\u0001\"!\n\n\u0007KZ\n\u0002\u0014\u000e\n\nBecause the test sets are designed for remote homology detection, we use small val-\nues of \u0001\n\n)A@\n\n\u0007\b\u0007\n\u0007WZ\n\n\u0002#\u000e\nBDC\n\nTA\u000e , where we normalized the kernel via\n5 We found that \u0007\nTA\u000e gave\n)A@\n\u000e not shown.) We use a publicly available SVM implementation\n\nthough results were similar for the two choices.\n\nTA\u000e and \u0007\n\t\nB*C\n\n\u0004\u0014>\n\nBDC\nZ\"\u000e\n\n(Data\n\n\u0002\u0014\u000e\n\n\u0001\"!\n\n\u0007\f\u0007\n\n\u0007WZ\n\n(www.cs.columbia.edu/compbio/svm) of the soft margin optimization algorithm described\nin [10]. For comparison, we include results from three other methods. These include the\noriginal experimental results from Jaakkola et al. for two methods: the SAM-T98 iterative\nHMM, and the Fisher-SVM method. We also test PSI-BLAST [3], an alignment-based\nmethod widely used in the biological community, on the same data using the methodology\ndescribed in [14].\n\n\u0004#>\n\n\u0007\f\t\n\n\u0005M!\u0006\u0005\n\n\u0004\u0014>\n\n\u000e\n\nFigure 2 illustrates the mismatch-SVM method\u2019s performance relative to three existing\nhomology detection methods as measured by ROC scores. The \ufb01gure includes results for\nSCOP families, and each series corresponds to one homology detection method.\nall\nQualitatively, the curves for Fisher-SVM and mismatch-SVM are quite similar. When\nwe compare the overall performance of two methods using a two-tailed signed rank test\nand\n[18, 19] based on ROC scores over the 33 families with a\nincluding a Bonferroni adjustment to account for multiple comparisons, we \ufb01nd only the\nfollowing signi\ufb01cant differences: Fisher-SVM and mismatch-SVM perform better than\nSAM-T98 (with p-values 1.3e-02 and 2.7e-02, respectively); and these three methods all\nperform signi\ufb01cantly better than PSI-BLAST in this experiment.\n\n-value threshold of -M5\n\n-\u000f\u0007\n\nFigure 3 shows a family-by-family comparison of performance of the \u0007\f\u0007\n\nSVM and Fisher-SVM using ROC scores in plot (A) and ROC-50 scores in plot (B). 1 In\nboth plots, the points fall approximately evenly above and below the diagonal, indicating\nlittle difference in performance between the two methods. Figure 4 shows the improvement\nprovided by including mismatches in the SVM kernel. The \ufb01gures plot ROC scores (plot\n\nTA\u000e -mismatch-\n\n1The ROC-50 score is the area under the graph of the number of true positives as a function of\nfalse positives, up to the \ufb01rst 50 false positives, scaled so that both axes range from 0 to 1. This\nscore is sometimes preferred in the computational biology community, motivated by the idea that a\nbiologist might be willing to sift through about 50 false positives.\n\n:\n\u001c\n1\n\u0002\n!\n\u001e\n\u0007\n.\n\u0001\nB\n\u001e\n&\n!\n!\n>\n)\n@\n!\n\u0005\n\u000e\n&\n!\n\u0005\n\u000e\n\u000b\n!\n\u000b\n)\n@\n\u0007\n\u000e\n&\n!\n&\n!\nT\n\u0018\n!\n\f35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\ns\ne\n\ni\nl\ni\n\nm\na\nf\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n0\n0.5\n\n(5,1)-Mismatch-SVM ROC\nFisher-SVM ROC\nSAM-T98\nPSI-BLAST\n\n0.55\n\n0.6\n\n0.65\n\n0.7\n\n0.8\n\n0.85\n\n0.9\n\n0.95\n\n1\n\n0.75\nROC\n\nFigure 2: Comparison of four homology detection methods. The graph plots the total number of\nfamilies for which a given method exceeds an ROC score threshold.\n\n(A)) and ROC-50 scores (plot (B)) for two string kernel SVM methods: using \u0001\nmismatch kernel, and using \u0001\nchoice with\nwithout, showing that mismatching gives signi\ufb01cantly better generalization performance.\n\n- . Almost all of the families perform better with mismatching than\n\n(no mismatch) spectrum kernel, the best-performing\n\n,\n\n5 Discussion\n\nWe have presented a class of string kernels that measure sequence similarity without re-\nquiring alignment or depending upon a generative model, and we have given an ef\ufb01cient\nmethod for computing these kernels. For the remote homology detection problem, our dis-\ncriminative approach \u2014 combining support vector machines with the mismatch kernel \u2014\nperforms as well in the SCOP experiments as the most successful known method.\n\nA practical protein classi\ufb01cation system would involve fast multi-class prediction \u2013 poten-\ntially involving thousands of binary classi\ufb01ers \u2013 on massive test sets. In such applications,\ncomputational ef\ufb01ciency of the kernel function becomes an important issue. Chris Watkins\n[20] and David Haussler [21] have recently de\ufb01ned a set of kernel functions over strings,\nand one of these string kernels has been implemented for a text classi\ufb01cation problem [22].\nin the length of the input se-\nHowever, the cost of computing each kernel entry is\nquences. Similarly, the Fisher kernel of Jaakkola et al. requires quadratic-time computation\n\nfor each Fisher vector calculated. The \u0007\n\ncompute for values of\nthat are practical in applications, allows computation of multi-\nple kernel values in one pass, and signi\ufb01cantly improves performance over the previously\npresented (mismatch-free) spectrum kernel.\n\n\u0002#\u000e -mismatch kernel is relatively inexpensive to\n\n\u0001\u0015!\n\nMany family-based remote homogy detection algorithms incorporate a method for select-\ning probable domain homologs from unannotated protein sequence databases for additional\ntraining data. In these experiments, we used the domain homologs that were identi\ufb01ed by\nSAM-T98 (an iterative HMM-based algorithm) as part of the Fisher-SVM method and in-\ncluded in the datasets; these homologs may be more useful to the Fisher kernel than to the\nmismatch kernel. We plan to extend our method by investigating semi-supervised tech-\nniques for selecting unannotated sequences for use with the mismatch-SVM.\n\n&\n\u0007\n\u0002\n&\nT\n&\n\n\u0002\n&\n\u001e\n\u0007\n.\n3\n\u000e\n\u0002\n\fC\nO\nR\nM\nV\nS\n\n \n\n-\nr\ne\nh\ns\nF\n\ni\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0.5\n\n0.55\n\n0.6\n\n0.65\n\n0.7\n\n0.75\n\n0.8\n\n0.85\n\n0.9\n\n0.95\n\n1\n\n(5,1)-Mismatch-SVM ROC\n\n(A)\n\n0\n5\nC\nO\nR\nM\nV\nS\n\n \n\n-\nr\ne\nh\ns\nF\n\ni\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n(5,1)-Mismatch-SVM ROC50\n\n(B)\n\nFigure 3: Family-by-family comparison of\n-mismatch-SVM with Fisher-SVM. The coor-\ndinates of each point in the plot are the ROC scores (plot (A)) or ROC-50 scores (plot (B)) for one\n(x-axis) and Fisher-SVM\nSCOP family, obtained using the mismatch-SVM with\n(y-axis). The dotted line is\n\n ,\n\n.\n\n\u0004\u0005\u0002\n\n\u0001\u0003\u0002\n\n\u000f\u0014\u0011\n\n\u0006\u0007\u0002\u0003\b\n\n \n\n-\n\nC\nO\nR\nM\nV\nS\nm\nu\nr\nt\nc\ne\np\nS\n \n3\n=\nk\n\n1\n\n0.95\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0.5\n\n0.55\n\n0.6\n\n0.65\n\n0.7\n\n0.75\n\n0.8\n\n0.85\n\n0.9\n\n0.95\n\n1\n\n(5,1)-Mismatch-SVM ROC\n\n(A)\n\n \n\n-\n\n0\n5\nC\nO\nR\nM\nV\nS\nm\nu\nr\nt\nc\ne\np\nS\n \n3\n=\nk\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n(5,1)-Mismatch-SVM ROC50\n\n(B)\n\nFigure 4: Family-by-family comparison of\n-mismatch-SVM with spectrum-SVM. The co-\nordinates of each point in the plot are the ROC scores (plot (A)) or ROC-50 scores (plot (B)) for one\nSCOP family, obtained using the mismatch-SVM with\n(x-axis) and spectrum-SVM\nwith\n\n(y-axis). The dotted line is\n\n ,\n\n.\n\n\u0004\n\u0002\n\n\u0001\t\u0002\n\n\u000f\u0012\u0011\u0014\u0013\n\n\u0006\u000b\u0002\r\b\n\n\u0001\u000b\u0002\u0003\f\n\nMany interesting variations on the mismatch kernel can be explored using the framework\npresented here. For example, explicit \u0001 -mer feature selection can be implemented dur-\ning calculation of the kernel matrix, based on a criterion enforced at each leaf or internal\nnode. Potentially, a good feature selection criterion could improve performance in certain\napplications while decreasing kernel computation time.\nIn biological applications, it is\nalso natural to consider weighting each \u0001 -mer instance contribution to a feature coordinate\nby evolutionary substitution probabilities. Finally, one could use linear combinations of\nto capture similarity of different length \u0001 -mers. We believe that further\nexperimentation with mismatch string kernels could be fruitful for remote protein homol-\nogy detection and other biological sequence classi\ufb01cation problems.\n\nB\u0010\u000eEC\n\n)\u000f\u000e\n\nkernels\u0004#>\n\nAcknowledgments\n\nCL is partially supported by NIH grant LM07276-02. WSN is supported by NSF grants\nDBI-0078523 and ISI-0093302. We thank Nir Friedman for pointing out the connection\nwith Fisher scores for Markov chain models.\n\n\f\n\n\u0013\n\u0011\n\f\n\n\u0011\n@\n\fReferences\n\n[1] M. S. Waterman, J. Joyce, and M. Eggert. Computer alignment of sequences, chapter Phyloge-\n\nnetic Analysis of DNA Sequences. Oxford, 1991.\n\n[2] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. A basic local alignment\n\nsearch tool. Journal of Molecular Biology, 215:403\u2013410, 1990.\n\n[3] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lip-\nman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs.\nNucleic Acids Research, 25:3389\u20133402, 1997.\n\n[4] Michael Gribskov, Andrew D. McLachlan, and David Eisenberg. Pro\ufb01le analysis: Detection of\n\ndistantly related proteins. PNAS, pages 4355\u20134358, 1987.\n\n[5] A. Bairoch. The PROSITE database, its status in 1995. Nucleic Acids Research, 24:189\u2013196,\n\n1995.\n\n[6] T. K. Attwood, M. E. Beck, D. R. Flower, P. Scordis, and J. N Selley. The PRINTS protein\n\n\ufb01ngerprint database in its \ufb01fth year. Nucleic Acids Research, 26(1):304\u2013308, 1998.\n\n[7] A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden markov models in compu-\ntational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501\u2013\n1531, 1994.\n\n[8] S. R. Eddy. Multiple alignment using hidden markov models.\n\nIn Proceedings of the Third\nInternational Conference on Intelligent Systems for Molecular Biology, pages 114\u2013120. AAAI\nPress, 1995.\n\n[9] P. Baldi, Y. Chauvin, T. Hunkapiller, and M. A. McClure. Hidden markov models of biological\n\nprimary sequence information. PNAS, 91(3):1059\u20131063, 1994.\n\n[10] T. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting remote\n\nprotein homologies. Journal of Computational Biology, 2000.\n\n[11] T. Jaakkola, M. Diekhans, and D. Haussler. Using the \ufb01sher kernel method to detect remote\nIn Proceedings of the Seventh International Conference on Intelligent\n\nprotein homologies.\nSystems for Molecular Biology, pages 149\u2013158. AAAI Press, 1999.\n\n[12] V. N. Vapnik. Statistical Learning Theory. Springer, 1998.\n[13] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge,\n\n2000.\n\n[14] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein\n\nclassi\ufb01cation. Proceedings of the Paci\ufb01c Biocomputing Symposium, 2002.\n\n[15] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. SCOP: A structural classi\ufb01cation\nof proteins database for the investigation of sequences and structures. Journal of Molecular\nBiology, 247:536\u2013540, 1995.\n\n[16] M. Sagot. Spelling approximate or repeated motifs using a suf\ufb01x tree. Lecture Notes in Com-\n\nputer Science, 1380:111\u2013127, 1998.\n\n[17] G. Pavesi, G. Mauri, and G. Pesole. An algorithm for \ufb01nding signals of unknown length in DNA\nsequences. Bioinformatics, 17:S207\u2013S214, July 2001. Proceedings of the Ninth International\nConference on Intelligent Systems for Molecular Biology.\n\n[18] S. Henikoff and J. G. Henikoff. Embedding strategies for effective use of information from\n\nmultiple sequence alignments. Protein Science, 6(3):698\u2013705, 1997.\n\n[19] S. L. Salzberg. On comparing classi\ufb01ers: Pitfalls to avoid and a recommended approach. Data\n\nMining and Knowledge Discovery, 1:371\u2013328, 1997.\n\n[20] C. Watkins. Dynamic alignment kernels. Technical report, UL Royal Holloway, 1999.\n[21] D. Haussler. Convolution kernels on discrete structure. Technical report, UC Santa Cruz, 1999.\n[22] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. Text classi\ufb01cation using\n\nstring kernels. Preprint.\n\n\f", "award": [], "sourceid": 2179, "authors": [{"given_name": "Eleazar", "family_name": "Eskin", "institution": null}, {"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "William", "family_name": "Noble", "institution": null}, {"given_name": "Christina", "family_name": "Leslie", "institution": null}]}