{"title": "Text Classification using String Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 563, "page_last": 569, "abstract": null, "full_text": "Text Classification using String Kernels \n\nHUlna Lodhi \n\nJohn Shawe-Taylor \n\nN ello Cristianini \n\nChris Watkins \n\nDepartment of Computer Science Royal Holloway, University of London \n\nEgham, Surrey TW20 OEX, UK \n\n{huma, john, nello, chrisw}Cdcs.rhbnc.ac.uk \n\nAbstract \n\nWe introduce a novel kernel for comparing two text documents. \nThe kernel is an inner product in the feature space consisting of \nall subsequences of length k. A subsequence is any ordered se(cid:173)\nquence of k characters occurring in the text though not necessarily \ncontiguously. The subsequences are weighted by an exponentially \ndecaying factor of their full length in the text, hence emphasising \nthose occurrences which are close to contiguous. A direct compu(cid:173)\ntation of this feature vector would involve a prohibitive amount of \ncomputation even for modest values of k, since the dimension of \nthe feature space grows exponentially with k. The paper describes \nhow despite this fact the inner product can be efficiently evaluated \nby a dynamic programming technique. A preliminary experimental \ncomparison of the performance of the kernel compared with a stan(cid:173)\ndard word feature space kernel \nresults. \n\n[6] is made showing encouraging \n\n1 \n\nIntroduction \n\nStandard learning systems (like neural networks or decision trees) operate on in(cid:173)\nput data after they have been transformed into feature vectors XI, \u2022\u2022\u2022 , Xl E X from \nan n dimensional space. There are cases, however, where the input data can not \nbe readily described by explicit feature vectors: for example biosequences, images, \ngraphs and text documents. For such datasets, the construction of a feature extrac(cid:173)\ntion module can be as complex and expensive as solving the entire problem. An \neffective alternative to explicit feature extraction is provided by kernel methods. \n\nKernel-based learning methods use an implicit mapping ofthe input data into a high \ndimensional feature space defined by a kernel function, i.e. a function returning the \ninner product between the images of two data points in the feature space. The \nlearning then takes place in the feature space, provided the learning algorithm can \nbe entirely rewritten so that the data points only appear inside dot products with \nother data points. \n\nSeveral linear algorithms can be formulated in this way, for clustering, classification \nand regression. The most typical example of kernel-based systems is the Support \n\n\fVector Machine (SVM) [10, 3], that implements linear classification. \n\nOne interesting property of kernel-based systems is that, once a valid kernel function \nhas been selected, one can practically work in spaces of any dimensionality with(cid:173)\nout paying any computational cost, since the feature mapping is never effectively \nperformed. In fact, one does not even need to know what features are being used. \nIn this paper we examine the use of a kernel method based on string alignment for \ntext categorization problems. \n\nA standard approach [5] to text categorisation makes use of the so-called bag of \nwords (BOW) representation, mapping a document to a bag (i.e. a set that counts \nrepeated elements), hence losing all the word order information and only retaining \nthe frequency of the terms in the document. This is usually accompanied by the \nremoval of non-informative words (stop words) and by the replacing of words by \ntheir stems, so losing inflection information. This simple technique has recently been \nused very successfully in supervised learning tasks with Support Vector Machines \n(SVM) [5]. \n\nIn this paper we propose a radically different approach, that considers documents \nsimply as symbol sequences, and makes use of specific kernels. The approach is \nentirely subsymbolic, in the sense that it considers the document just like a unique \nlong sequence, and still it is capable to capture topic information. We build on recent \nadvances [11, 4] that demonstrated how to build kernels over general structures \nlike sequences. The most remarkable property of such methods is that they map \ndocuments to vectors without explicitly representing them, by means of sequence \nalignment techniques. A dynamic programming technique makes the computation \nof the kernels very efficient (linear in the documents length). \n\nIt is surprising that such a radical strategy, only extracting allignment information, \ndelivers positive results in topic classification, comparable with the performance \nof problem-specific strategies: it seems that in some sense the semantic of the \ndocument can be at least partly captured by the presence of certain substrings of \nsymbols. \n\nSupport Vector Machines [3] are linear classifiers in a kernel defined feature space. \nThe kernel is a function which returns the dot product of the feature vectors \u00a2(x) \nand \u00a2(X') of two inputs x and x' K(x, x') = \u00a2(x)T \u00a2(X'). Choosing very high dimen(cid:173)\nsional feature spaces ensures that the required functionality can be obtained using \nlinear classifiers. The computational difficulties of working in such feature spaces \nis avoided by using a dual representation of the linear functions in terms of the \ntraining set S = {(Xl, Y1) ,(X2, Y2), ... , (xm, Ym)}, \n\nm \n\nf(x) = LCkiYiK(X, Xi) - b. \n\n;=1 \n\nThe danger of overfitting by resorting to such a high dimensional space is averted \nby maximising the margin or a related soft version of this criterion, a strategy that \nhas been shown to ensure good generalisation despite the high dimensionality [9,8]. \n\n2 A Kernel for Text Sequences \n\nIn this section we describe a kernel between two text documents. The idea is to \ncompare them by means of the substrings they contain: the more substrings in \ncommon, the more similar they are. An important part is that such substrings do \nnot need to be contiguous, and the degree of contiguity of one such substring in a \ndocument determines how much weight it will have in the comparison. \n\n\fFor example: the substring 'c-a-r' is present both in the word 'card' and in the \nword ' custard', but with different weighting. For each such substring there is a \ndimension of the feature space, and the value of such coordinate depends on how \nfrequently and how compactly such string is embedded in the text. In order to deal \nwith non-contiguous substrings, it is necessary to introduce a decay factor). E (0,1) \nthat can be used to weight the presence of a certain feature in a text (see Definition \n1 for more details). \nEXaIllple. Consider the words cat, car, bat, bar. If we consider only k = 2, we \nobtain an 8-dimensional feature space, where the words are mapped as follows: \n\nc-a \n).2 \nrp(cat) \n).2 \nrp(car) \nrp(bat) 0 \nrp(bar) 0 \n\nc-t \n).3 \n0 \n0 \n0 \n\na-t b-a b-t \n).2 \n0 \n).2 \n0 \n\n0 \n0 \n).2 \n).2 \n\n0 \n0 \n).3 \n0 \n\nc-r \n0 \n).3 \n0 \n0 \n\na-r b-r \n0 \n).2 \n0 \n).2 \n\n0 \n0 \n0 \n).3 \n\nHence, the unnormalized kernel between car and cat is K(car,cat) = ).4, wherease \nthe normalized version is obtained as follows: K(car,car) = K(cat,cat) = 2).4 +).6 \nand hence IC(car,cat) = ).4/(2).4 + ).6) = 1/(2 + ).2). Note that in general the \ndocument will contain more than one word, but the mapping for the whole document \nis into one feature space. Punctuation is ignored, but spaces are retained. \nHowever, for interesting substring sizes (eg > 4) direct computation of all the rele(cid:173)\nvant features would be impractical even for moderately sized texts and hence explicit \nuse of such representation would be impossible. But it turns out that a kernel using \nsuch features can be defined and calculated in a very efficient way by using dynamic \nprogamming techniques. \n\nWe derive the kernel by starting from the features and working out their inner \nproduct. In this case there is no need to prove that it satisfies Mercer's conditions \n(symmetry and positive semi-definiteness) since they will follow automatically from \nits definition as an inner product. This kernel is based on work [11, 4] mostly mo(cid:173)\ntivated by bioinformatics applications. It maps strings to a feature vector indexed \nby all k tuples of characters. A k-tuple will have a non-zero entry if it occurs as a \nsubsequence anywhere (not necessarily contiguously) in the string. The weighting \nof the feature will be the sum over the occurrences of the k-tuple of a decaying \nfactor of the length of the occurrence. \n\nDefinition 1 (String subsequence kernel) Let ~ be a finite alphabet. A string is a \nfinite sequence of characters from~, including the empty sequence. For strings s, t, \nwe denote by I s I the length of the string s = Sl .\u2022. sl s I, and by st the string obtained \nby concatenating the strings sand t. The string sri : j] is the substring Si \u2022\u2022\u2022 Sj of \ns. We say that u is a subsequence of s, if there exist indices i = (i l , ... ,ilul)' with \n1 :S i l < ... < i 1ul :S lsi, such that Uj = S i j' for j = 1, ... ,lui, 01' u = sri] for short. \ni l + 1. We denote by ~n the set of \nThe length l(i) of the subsequence in s is ilul -\nall finite strings of length n, and by~\u00b7 the set of all strings \n\nDO \n\n(1) \n\nWe now define feature spaces Fn = lR 1: n\ngiven by defining the u coordinate rpu (s) for each u E ~n. We define \n\n\u2022 The feature mapping rp for a string s is \n\nrpu(s) = L ).l(i) , \n\ni:u = s [il \n\n(2) \n\n\ffor some ..\\ < 1. These features measure the number of occurrences of subsequences \nin the string-s weighting them according to their lengths. Hence, the inner product of \nthe feature vectors for two strings sand t give a sum over all common subsequences \nweighted according to their frequency of occurrence and lengths \n\nL (*u(s) . u(t)) = L L ..\\l(i) L ..\\l(j) \nL L L ..\\l(i)+l(j). \n\nuEEn i:u= s [iJ \n\nj :u = t liJ \n\nuEEn \n\nuEEn i :u = s [iJ j :u=t liJ \n\nIn order to derive an effective procedure for computing such kernel, we introduce \nan additional function which will aid in defining a recursive computation for this \nkernel. Let \n\nKHs, t) \n\nL L L ..\\l s l+ltl- i ,-j,+2, \n\nuEE' i: u = s [iJj :u = t liJ \n1, ... , n -1, \n\nthat is counting the length to the end of the strings sand t instead of just l(i) and \nl(j). We can now define a recursive computation for K: and hence compute K n , \n\nDefinition 2 Recursive computation of the subsequence kernel. \n\nKb(s, t) \nK:(s,t) \nK;(s,t) \n\nKHsx, t) \n\n1, for all s, t, \n0, if min(lsl, It l) < i, \n0, if min(lsl, It l) < i, \n..\\K:(s, t) + L KL1(S, t[l : j - 1])..\\ltl-1+ 2 , \n\nj :tj=X \n\ni = 1, .. . ,n - 1, \n\nKn(s,t)+ L K~_1(s,t[1:j-1])..\\2. \n\nj:tj = x \n\nThe correctness of this recursion follows from observing how the length of the strings \nhas increased, incurring a factor of ..\\ for each extra character, until the full length \nof n characters has been attained. If we wished to compute Kn(s, t) for a range \nof values of n, we would simply perform the computation of K:(s, t) up to one less \nthan the largest n required, and then apply the last recursion for each Kn (s, t) \nthat is needed using the stored values of K:(s, t). We can of course create a kernel \nK (s, t) that combines the different Kn (s, t) giving different (positive) weightings for \neach n. Once we have create such a kernel it is natural to normalise to remove any \nbias introduced by document length. We can produce this effect by normalising \nthe feature vectors in the feature space. Hence, we create a new embedding 1>(s) = \nJJ.!l \nI \n11 1>(s )ll' w 1C glVes rIse to t e erne \n\nh\u00b7 h\u00b7 \n\nh k \n\n. \n\n/\u00ab(s, t) \n\n/. \n\\ **(s) . **(t) = \\ 1I**(s)11 . 1I **(t) 11 \n\n. ) / **(s) \n\n**(t)\n\n) \n\n1 \n\nK(s,t) \n\n1I**(s)IIII**(t)11 (**(s) . **(t)) = )K(s, s)K(t, t) \n\nThe normalised kernel introduced above was implemented using the recursive for(cid:173)\nmulas described above. The next section gives some more details of the algorithmics \nand this is followed by a section describing the results of applying the kernel in a \nSupport Vector Machine for text classification. \n\n\f3 Algorithmics \n\nIn this section we describe how special design techniques provide a significant speed(cid:173)\nup of the procedure, by both accelerating the kernel evaluations and reducing their \nnumber. \n\nWe used a simple gradient based implementation of SVMs (see [3]) with a fixed \nthreshold. In order to deal with large datasets, we used a form of chunking: begin(cid:173)\nning with a very small subset of the data and gradually building up the size of the \ntraining set, while ensuring that only points which failed to meet margin 1 on the \ncurrent hypothesis were included in the next chunk. \n\nSince each evaluation of the kernel function requires not neglect able computational \nresources, we designed the system so to only calculate those entries of the kernel \nmatrix that are actually required by the training algorithm. This can significantly \nreduce the training time, since only a relatively small part of the kernel matrix is \nactually used by our implementation of SVM. \n\nSpecial care in the implementation of the kernel described in Definition 1 can signif(cid:173)\nicantly speed-up its evaluation. As can be seen from the description of the recursion \nin Definition 2, its computation takes time proportional to n I s Iii 12, as the outer(cid:173)\nmost recursion is over the sequence length and for each length and each additional \ncharacter in sand i a sum over the sequence i must be evaluated. \nThe complexity of the computation can be reduced to 0 (n I s Iii I), by first evaluating \n\nK;'(sx, i) = L Kf_l(s, i[l : j - 1]).xltl-H2 \n\nj:tj = x \n\nand observing that we can then evaluate KI(s, i) with the O(lsllil) recursion, \n\nKI(sx, i) = .xKi(s, i) + KI'(sx, i). \n\nNow observe that Ki'(sx, iu) = .x1uIKI'{sx, i), provided x does not occur in u, while \n\nK:'(sx, ix) = .x (Kf'( sx, i) + .xKf_l (s, i)) . \n\nThese observations together give an O( lsl lt l) recursion for computing K:'(s, t). \nHence, we can evaluate the overall kernel in O(n lslltl) time. \n\n4 Experimental Results \n\nOur aim was to test the efficacy of this new approach to feature extraction for text \ncategorization, and to compare with a state-of-the-art system such as the one used \nin [6]. Expecially, we wanted to see how the performance is affected by the tunable \nparameter k (we have used values 3, 5 and 6). As expected, using longer substrings \nin the comparison of two documents gives an improved performance. \n\nWe used the same dataset as that reported in [6], namely the Reuters-21578 [7], \nas well as the Medline doucment collection of 1033 document abstracts from the \nNational Library of Medicine. We performed all of our experiments on a subset of \nfour categories, 'earn', 'acq', 'crude', and 'corn'. \n\nA confusion matrix can be used to summarize the performance of the classifier \n(number of true/false positives/negatives): \n\nP N \nP TP FP \nN FN TN \n\n\fWe define preCISIOn: P = T:~P and recall:R = T:~N. We then define the \nquantitiy F1 = ;,~~ to measure the performance of the classifier. \n\nWe applied the two different kernels to a subset of Reuters of 380 training examples \nand 90 test examples. The only difference in the experiments was the kernel used. \nThe splits of the data were had the following sizes and numbers of positive examples \nin training and test sets: numbers of positive examples in training (testing) set out \nof 370 (90): earn 152 (40); 114 (25); 76 (15); 38 (10) in the Reuters database. \n\nThe preliminary experiment used different values of k, in order to identify the \noptimal one, with the category 'earn'. The follwing experiments all used a sequence \nlength of 5 for the string subsequences kernel. We set A = 0.5. The results obtained \nare shown in the following where the precision, recall and F1 values are shown for \nboth kernels. \n\n3 S-K \n5 S-K \n6 S-K \nW-K \n\nF1 \n0.925 \n0.936 \n0.936 \n0.925 \n\nPrecision Recall \n0.878 \n0.888 \n0.888 \n0.867 \n\n0.981 \n0.992 \n0.992 \n0.989 \n\n# SV \n138 \n237 \n268 \n250 \n\nTable 1: F1, Precision, Recall and number of Support Vectors for top reuter category \nearn averaged over 10 splits (n S-K == string kernel oflength n, W-K == word kernel \n\n5 S-K kernel \n\nearn \n0.936 \nacq \n0.867 \ncrude 0.936 \ncorn \n0.779 \n\n0.888 \n0.828 \n0.90 \n0.7 \n\nF1 Precis. Recall #SV \n237 \n269 \n262 \n231 \n\n0.992 \n0.914 \n0.979 \n0.886 \n\nF1 \n0.925 \n0.802 \n0.904 \n0.762 \n\nW-K kernel \n\nPrecis. Recall \n0.867 \n0.7680 \n0.907 \n0.71 \n\n0.989 \n0.843 \n0.91 \n0.833 \n\n# SV \n250 \n276 \n262 \n264 \n\nTable 2: Precision, Recall and F1 numbers for 4 categories for the two kernels: word \nkernel (W-K) and subsequences kernel (5 S-K) \n\nThe results are better in one category, similar or slightly better for the other cate(cid:173)\ngories. They certainly indicate that the new kernel can outperform the more clas(cid:173)\nsical approach, but equally the performance is not reliably better. T he last table \nshows the results obtained for two categories in medLine data, numbers 20 and 23. \n\nQuery Train/Test 3 S-K(#SV) \n0.20 (101) \n#20 \n0.534 (107) \n#23 \n\n24/15 \n22/15 \n\n5 S-K(#SV) \n0.637 (295) \n0.409 (302) \n\n6 S-K(#SV) W-K(#SV) \n0.235 (598) \n0.636 (618) \n\n0.75 (386) \n0.75 (382) \n\nTable 3: F1 and number of Support Vectors for top two Medline queries \n\n5 Conclusions \n\nThe paper has presented a novel kernel for text analysis, and tested it on a catego(cid:173)\nrization task, which relies on evaluating an inner product in a very high dimensional \nfeature space. For a given sequence length k (k = 5 was used in the experiments \nreported) the features are indexed by all strings of length k. Direct computation of \n\n\fall the relevant features would be impractical even for moderately sized texts. The \npaper has presented a dynamic programming style computation for computing the \nkernel directly from the input sequences without explicitly calculating the feature \nvectors. \n\nFurther refinements of the algorithm have resulted in a practical alternative to \nthe more standard word feature based kernel used in previous SVM applications \nto text classification [6]. We have presented an experimental comparison of the \nword feature kernel with our subsequences kernel on a benchmark dataset with \nencouraging results. The results reported here are very preliminary and many \nquestions remain to be resolved. First more extensive experiments are required \nto gain a more reliable picture of the performance of the new kernel, including the \neffect of varying the subsequence length and the parameter).. The evaluation of \nthe new kernel is still relatively time consuming and more research is needed to \ninvestigate ways of expediting this phase of the computation. \n\nReferences \n\n[1] M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the \npotential function method in pattern recognition learning. Automation and \nRemote Control, 25:821-837, 1964. \n\n[2] B. E. Boser, 1. M. Guyon, and V. N. Vapnik. A training algorithm for optimal \nmargin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM \nWorkshop on Computational Learning Theory, pages 144-152. ACM Press, \n1992. \n\n[3] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Ma(cid:173)\n\nchines. Cambridge University Press, 2000. www.support-vector.net. \n\n[4] D. Haussler. Convolution kernels on discrete structures. Technical Report \nUCSC-CRL-99-10, University of California in Santa Cruz, Computer Science \nDepartment, July 1999. \n\n[5] T. Joachims. Text categorization with support vector machines: Learning with \nmany relevant features. Technical Report 23, LS VIII, University of Dortmund, \n1997. \n\n[6] T. Joachims. Text categorization with support vector machines. In Proceedings \n\nof European Conference on Machine Learning (ECML), 1998. \n\n[7] David Lewis. Reuters-21578 collection. Technical report, Available at: \n\nhttp://www.research.att.com/~ewis/reuters21578.html. 1987. \n\n[8] J. Shawe-Taylor and N. Cristianini Margin Distribution and Soft Margin In \n\nAdvances in Large Margin Classifiers, MIT Press 2000. \n\n[9] J. Shawe-Taylor, P. Bartlett, R. Williamson and M. Anthony Structural Risk \n\nMinimization over Data-Dependent Hierarchies In EEE Transactions on In(cid:173)\nformation Theory 1998 \n\n[10] V. Vapnik. Statistical Learning Theory. Wiley, 1998. \n[11] C. Watkins. Dynamic alignment kernels. Technical Report CSD-TR-98-11, \nRoyal Holloway, University of London, Computer Science department, January \n1999. \n\n\f", "award": [], "sourceid": 1869, "authors": [{"given_name": "Huma", "family_name": "Lodhi", "institution": null}, {"given_name": "John", "family_name": "Shawe-Taylor", "institution": null}, {"given_name": "Nello", "family_name": "Cristianini", "institution": null}, {"given_name": "Christopher", "family_name": "Watkins", "institution": null}]}*