{"title": "String Kernels, Fisher Kernels and Finite State Automata", "book": "Advances in Neural Information Processing Systems", "page_first": 649, "page_last": 656, "abstract": null, "full_text": "String Kernels, Fisher Kernels and Finite \n\nState Automata \n\nCraig Saunders \n\nJohn Shawe-Taylor \n\nAlexei Vinokourov \n\nDepartment of Computer Science \n\nRoyal Holloway, University of London \n\nEmail: { craig, j st, alexei }\u00ablcs. rhul. ac. uk \n\nAbstract \n\nIn this paper we show how the generation of documents can be \nthought of as a k-stage Markov process, which leads to a Fisher ker(cid:173)\nnel from which the n-gram and string kernels can be re-constructed. \nThe Fisher kernel view gives a more flexible insight into the string \nkernel and suggests how it can be parametrised in a way that re(cid:173)\nflects the statistics of the training corpus. Furthermore, the prob(cid:173)\nabilistic modelling approach suggests extending the Markov pro(cid:173)\ncess to consider sub-sequences of varying length, rather than the \nstandard fixed-length approach used in the string kernel. We give \na procedure for determining which sub-sequences are informative \nfeatures and hence generate a Finite State Machine model, which \ncan again be used to obtain a Fisher kernel. By adjusting the \nparametrisation we can also influence the weighting received by the \nfeatures . In this way we are able to obtain a logarithmic weighting \nin a Fisher kernel. Finally, experiments are reported comparing \nthe different kernels using the standard Bag of Words kernel as a \nbaseline. \n\n1 \n\nIntroduction \n\nRecently the string kernel [6] has been shown to achieve good performance on text(cid:173)\ncategorisation tasks. The string kernel projects documents into a feature space \nindexed by all k-tuples of symbols for some fixed k. The strength of the feature \nindexed by the k-tuple U = (Ul, ... , Uk) for a document d is the sum over all \noccurrences of U as a subsequence (not necessarily contiguous) in d, where each \noccurrence is weighted by an exponentially decaying function of its length in d. This \nnaturally extends the idea of an n-gram feature space where the only occurrences \nconsidered are contiguous ones. \n\nThe dimension of the feature space and the non-sparsity of even modestly sized \ndocuments makes a direct computation of the feature vector for the string kernel \ninfeasible. There is, however, a dynamic programming recursion that enables the \nsemi-efficient evaluation of the kernel [6]. String kernels are apparently making no \nuse of the semantic prior knowledge that the structure of words can give and yet \nthey have been used with considerable success. \n\n\fThe aim of this paper is to place the n-gram and string kernels in the context of \nprobabilistic modelling of sequences, showing that they can be viewed as Fisher ker(cid:173)\nnels of a Markov generation process. This immediately suggests ways of introducing \nweightings derived from refining the model based on the training corpus. \n\nFurthermore, this view also suggests extending consideration to subsequences of \nvarying lengths in the same model. This leads to a Finite State Automaton again \ninferred from the data. The refined probabilistic model that this affords gives rise \nto two Fisher kernels depending on the parametrisation that is chosen, if we take \nthe Fisher information matrix to be the identity. \n\nWe give experimental evidence suggesting that the new kernels are capturing useful \nproperties of the data while overcoming the computational difficulties of the original \nstring kernel. \n\n2 The Fisher VIew of the n-gram and String kernels \n\nIn this section we show how the string kernel can be thought of as a type of Fisher \nkernel [2] where the fixed-length subsequences used as the features in the string \nkernel correspond to the parameters for building the model. In order to give some \ninsight into the kernel we first give a Fisher formulat ion of the n-gram kernel (i.e. the \nstring kernel which considers only contiguous sequences), and then extend this to \nthe full string kernel. \n\nLet us assume that we have some document d of length s which is a sequence of \nsymbols belonging to some alphabet A, i.e. di E A, i = 1, ... , s. We can consider \ndocument d as being generated by a k-stage Markov process. According to this \nview, for sequences u E A k - l we can define the probability of observing a symbol x \nafter a sequence u as PU--+X. Sequences of k symbols therefore index the parameters \nof our model. The probability of a document d being generated by the model is \ntherefore \n\nIdl \n\nP(d) = II Pd[j-k+!:j-l]--+djl \n\nwhere we use the notation d[i: j] to denote the sequence didi +!\u00b7\u00b7 \u00b7dj . Now taking \nthe derivative of the log-probability: \n\nj = k \n\no In P( d) \n\no In TIj~k Pd[j -k+!:j -l]--+dj \n\nopu--+x \n\nIdl L olnpd[j-k+!:j-l]--+dj = tf(ux,d) \n\nj=k \n\nopu --+ x \n\nPu --+ x \n\n(1) \n\nwhere tf(ux,d) is the term frequency of ux in d, that is the number of times the \nstring ux occurs in d. l \n\n1 Since the pu-+x are not independent it is not possible to take the partial derivative of \n\none parameter without affecting others. However we can approximate our approach: \n\nWe introduce an extra character c. For each (n -\n\nI)-gram u we assign a sufficiently \nsmall probability to pu-+c and change the other pu-+x to pu-+x = pu-+x (1 - Pu-+c). We \nnow replace each occurence of Pu-+ c in P(d) by 1 - LaEA\\{ c }Pu-+ a . Thus, since uc never \noccurs in d and Pu-+x ~ pv-+x, the u --+ x Fisher score entry for a document d becomes \n\ntf( ux, d) \nPu-+x \n\ntf( uc, d) ~ tf( ux , d) \n\npu-+ c \n\npu-+ x \n\n\fThe Fisher kernel is subsequently defined to be \n\nk(d,d') = UJrIUd', \n\nwhere Ud is the Fisher score vector with ux-component a~n P( d) and I = Ed[UdUdTJ . \nIt has become traditional to set the matrix I to be the identity when defining a \nFisher kernel, though this undermines the very satisfying property of the pure defi(cid:173)\nnition that it is independent of the parametrisation. We will follow this same route \nmainly to reduce the complexity ofthe computation. We will, however, subsequently \nconsider alternative parameterisations. \n\np u--t x \n\nDifferent choices of the parameters PU-r X give rise to different models and hence \ndifferent kernels. It is perhaps surprising that the n-gram kernel is recovered (up \nto a constant factor) if we set PU-rX = IAI- I for all u E An-l and x E A, that is \nthe least informative parameter setting. This follows since the feature vector of a \ndocument d has entries \n\nWe therefore recover the n-gram kernel as the Fisher kernel of a model which uses \na uniform distribution for generating documents. \n\nBefore considering how the PU-r X might be chosen non-uniformly we turn our at(cid:173)\ntention briefly to the string kernel. \n\nWe have shown that we can view the n-gram kernel as a Fisher kernel. A little \nmore work is needed in order to place the full string kernel (which considers non(cid:173)\ncontiguous subsequences) in the same framework. \n\nFirst we define an index set Sk-l,q over all (possibly non-contiguous) subsequences \nof length k, which finish in position q, \n\nSk-l, q = {i : 1 :'S i l < i2 < ... < i k - l < i k = q}. \n\ni k -\n\nWe now define a probability distribution P Sk_1 ,q over Sk - l,q by weighting sequence \ni l + 1 is the length of i, and normalising with a \ni by )..l(i), where l(i) = \nfixed constant C . This may leave some probability unaccounted for, which can \nbe assigned to generating a spurious symbol. We denote by d [iJ the sequence of \ncharacters di1 di2 ... dik . We now define a text generation model that generates the \nsymbol for position q by first selecting a sequence i from Sk-l,q according to the \nfixed distribution P Sk _l,q and then generates the next symbol based on Pd[i'] -rdik for \nall possible values of dq where i' = (iI, i 2 , \u2022\u2022\u2022 , i k - l ) is the vector i without its last \ncomponent. We will refer to this model as the Generalised k-stage Markov model \nwith decay fa ctor )... Hence, if we assume that distributions are uniform \n\na In P ( d) \n\na In TIj~k I:iESk_l ,j P Sk-l,j (i)Pd[i']-rd ik \n\nf a In I:iEsk_l ,j P Sk -l ,j (i )Pd[i'] -rdik \n\napu-rx \n\nj = k \n\naPu-rx \n\nIdl \n\nIAI L L P Sk -1 ,j (i )Xux (d[i]) \n\nIdl \n\nIAIC- l L L )..l(i)Xux(d [i ]), \n\nj = k iESk_l ,j \n\n\fwhere Xux is the indicator function for string ux . It follows that the corresponding \nFisher features will be the weighted sum over all subsequences with decay factor A. \nIn other words we recover the string kernel. \n\nProposition 1 The Fisher kernel of the generalised k-stage Markov model with \ndecay fa ctor A and constant Pu--+x is th e string kernel of length k and decay fa ctor \nA. \n\n3 The Finite State Machine Model \n\nViewing the n-gram and string kernels as Fisher kernels of Markov models means \nwe can view the different sequences of k - 1 symbols as defining states with the \nnext symbol controlling the transition to the next state. We therefore arrive at a \nfinite state automaton with states indexed by A k - 1 and transitions labelled by the \nelements of A . Hence, if u E Ak -l the symbol x E A causes the transition to state \nv[2: k], where v = ux. \n\nOne drawback of the string kernel is that the value of k has to be chosen a-priori \nand is then fixed. A more flexible approach would be to consider different length \nsubsequences as features, depending on their frequency. Subsequences that occur \nvery frequently should be given a low weighting, as they do not contain much infor(cid:173)\nmation in the same way that stop words are often removed from the bag of words \nrepresentation. Rather than downweight such sequences an alternative strategy is \nto extend their length. Hence, the 3-gram com could be very frequent and hence \nnot a useful discriminator. By extending it either backwards or forwards we would \narrive at subsequences that are less frequent and so potentially carry useful infor(cid:173)\nmation. Clearly, extending a sequence will always reduce its frequency since the \nextension could have been made in many distinct ways all of which contribute to \nthe frequency of the root n-gram. \n\nAs this derivation follows more naturally from the analysis of the n-gram kernel \ndescribed in Section 2 we will only consider contiguous subsequences also known \nas substrings. We begin by introducing the general Finite State Machine (FSM) \nmodel and the corresponding Fisher kernel. \nDefinition 2 A Finite State Machin e model over an alphabet A IS a triple F = \n(~, J,p) where \n\n1. th e non-empty set ~ of states IS a finit e subset of A* \n\nclosed under taking substrings, \n\n2. the transition function J \n\nis defin ed by \n\nJ: ~ x A --+~, \n\nJ(u, x) = v [j : l(v)], wh ere v = ux and j = min{j : v [j : l(v)] E ~}, \n\nif the minimum is defined, otherwise the empty sequence f \n\n3. for each state u the function p gives a function Pu, which is either a distri(cid:173)\n\nbution over next symbols Pu (x) or th e all one function Pu (x) = 1, for u E ~ \nand x E A. \n\nGiven an FSM model F = (~, J, p) to process a document d we start at the state \ncorresponding to the empty sequence f (guaranteed to be in ~ as it is non-empty and \nclosed under taking substrings) and follow the transitions dictated by the symbols \n\n\fof the document. The probability of a document in the model is the product of the \nvalues on all of the transitions used: \n\nIdl \n\nP.:F (d) = II Pd[id -1](dj ), \n\nj =l \n\nwhere ij = min{i: d[i: j -1] E ~}. Note that requiring that the set ~ to be closed \nunder taking substrings ensures that the minimum in the definition of is is always \ndefined and that d[ij \n: j] does indeed define the state at stage j (this follows from \na simple inductive argument on the sequence of states) . \n\nIf we follow a similar derivation to that given in equation (1) we arrive at the \ncorresponding feature for document d and transition on x from u of \n\ntf( (u, x), d) \n\n() \n\n\u00a2;u,x d = \n\n()' \n\nPu x \n\nwhere we use tf( (u, x), d) to denote the frequency of the transition on symbol x \nfrom a state u with non-unity Pu in document d. \n\nHence, given an FSM model we can construct the corresponding Fisher kernel fea(cid:173)\nture vector by simply processing the document through the FSM and recording the \ncounts for each transition. The corresponding feature vector will be sparse relative \nto the dimension of the feature space (the total number of transitions in the FSM) \nsince only those transitions actually used will have non-zero entries. Hence, as for \nthe bag of words we can create feature vectors by listing the indices of transitions \nused followed by their frequency. The number of non-zero features will be at most \nequal to the number of symbols in the document. \nConsider taking ~ = U7==-Ol Ai with all the distributions Pu uniform for u E A k - 1 \nand Pu == 1 for other u. In this case we recover the k-gram model and corresponding \nkernel. \n\nA problem that we have observed when experiment ing with the n-gram model is \nthat if we estimate the frequencies of transitions from the corpus certain transitions \ncan become very frequent while others from the same state occur only rarely. In \nsuch cases the rare states will receive a very high weighting in the Fisher score \nvector. One would like to use the strategy adopted for the idf weighting for the \nbag of words kernel which is often taken to be \n\nwhere m is the number of documents and m i the number containing term i. The \nIn ensures that the contrast in weighting is controlled. We can obtain this effect in \nthe Fisher kernel if we reparametrise the transition probabilities as follows \n\nPu(x) = exp(- exp( -tu(x))), \n\nwhere tu(x) is the new parameter. With this parametrisation the derivative of the \nIn probabilities becomes \n\na lnpu(x) \n\natu(x) \n\nexp(-tu(x )) = -lnpu(x), \n\nas required. \n\nAlthough this improves performance the problem of frequent substrings being un(cid:173)\ninformative remains. We now consider the idea outlined above of moving to longer \nsubsequences in order to ensure that transitions are informative. \n\n\f4 Choosing Features \n\nThere is a critical frequency at which the most information is conveyed by a feature. \nIf it is ubiquitous as we observed above it gives little or no information for analysing \ndocuments. If on the other hand it is very infrequent it again will not be useful \nsince we are only rarely able to use it. The usefulness is maximal at the threshold \nbetween these two extremes. Hence, we would like to create states that occur not \ntoo frequently and not too infrequently. \n\nA natural way to infer the set of such states is from the training corpus. We select \nall substrings that have occurred at least t t imes in the document corpus, where t \nis a small but statistically visible number. In our experiments we took t = 10. \nHence, given a corpus S we create the FSM model F t (S) with \n\nI;t (S) = {u E A* : u occurs at least t times in the corpus S} . \n\nTaking this definition of I;t (S) we construct the corresponding finite state machine \nmodel as described in Definition 2. We will refer to the model F t as the frequent \nset FSM at threshold t. \n\nWe now construct the transition probabilities by processing the corpus through \nthe Ft (S) keeping a tally of the number of times each transition is actually used. \nTypically we initialise the counts to some constant value c and convert the resulting \ncounts into probabilities for the model. Hence, if fu ,x is the number of times we \nleave state u processing symbol x, the corresponding probabilities will be \n\n( ) \n\nPu X = lAic + 2::x/EA fu ,x l \n\nfu,x + c \n\n(2) \n\nNote that we will usually exclude from the count the transitions at the beginning \nof a document d that start from states d[l : j] for some j ?: O. \nThe following proposition demonstrates that the model has the desired frequency \nproperties for the transitions. We use the notation u ~ v to indicate the transition \nfrom state u to state v on processing symbol x. \n\nProposition 3 Given a corpus S th e FSM model F t (S) satisfies th e following prop(cid:173)\nerty. Ign oring transitions from states indexed by d[l : i] for some docum ent d of th e \ncorpus, th e frequ ency counts f u,x for transitions u ~ v in th e corpus S satisfy \n\nfor all u E I;t (S) . \n\nProof. Suppose that for some state u E I;t (S) \n\n(3) \n\nThis implies that the string u has occurred at least tlAI times at the head of a \ntransition not at the beginning of a document. Hence, by the pigeon hole principle \nthere is ayE A such that y has occurred t times immediately before one of the \ntransitions in the sum of (3). Note that this also implies that yu occurs at least t \ntimes in the corpus and therefore will be in I;t (S). Consider one of the transitions \nthat occurs after yu on some symbol x . This transition will not be of the form \nu ~ v but rather yu ~ v contradicting its inclusion in the sum (3). Hence, the \nproposition holds. \u2022 \n\n\fNote that the proposition implies that no individual transition can be more frequent \nthan the full sum. The proposition also has useful consequences for the maximum \nweighting for any Fisher score entries as the next corollary demonstrates. \n\nCorollary 4 Given a corpus S if we constru ct th e FSM model F t (S) and compute \nth e probabilities by counting transitions ignoring those from states indexed by d[l : i] \nfor some docum ent d of th e corpus, th e probabilities on th e transitions will satisfy \n\nProof. We substitute the bound given in the proposition into the formula (2). \u2022 \nThe proposition and corollary demonstrate that the choice of Ft(S) as an FSM \nmodel has the desirable property that all of the states are meaningfully frequent \nwhile none of the transitions is too frequent and furthermore the Fisher weighting \ncannot grow too large for any individual transition. \n\nIn the next section we will present experimental results testing the kernels we have \nintroduced using the standard and logarithmic weightings. The baseline for the \nexperiments will always be the bag of words kernel using the TFIDF weighting \nscheme. It is perhaps worth noting that though the IDF weighting appears similar \nto those described above it makes critical use of the distribution of terms across \ndocuments, something that is incompatible with the Fisher approach that we have \nadopted. It is therefore very exciting to see the results that we are able to obtain \nusing these syntactic features and sub-document level weightings. \n\n5 Experimental Results \n\nOur experiments were conducted on the top 10 categories of the standard Reuters-\n21578 data set using the \"Mod Apte\" split. We compared the standard n-gram \nkernel with a Uniform, non-uniform and In weighting scheme, and the variable(cid:173)\nlength FSM model described in Section 4 both with uniform weighting and a In \nweighting scheme. As mentioned in Section 4, the parameter t was set to 10. In \norder to keep the comparison fair, the n-gram kernel features were also pruned from \nthe feature vector if they occured less than 10 times. For our experiments we used \n5-gram features, which have previously been reported to give the best results [5]. \nThe standard bag of words model using the normal tfidf weighting scheme is used \nas a baseline. Once feature vectors had been created they were normalised and \nthe SVMlight software package [3] was used with the default parameter settings \nto obtain outputs for the test examples. In order to compare algorithms, we used \nthe average performance measure commonly used in Information Retrieval (see e.g. \n[4]). This is the average of precision values obtained when thresholding at each \npositively classified document. If all positive documents in the corpus are ranked \nhigher than any negative documents, then the average precision is 100%. Average \nprecision incorporates both precision and recall measures and is highly sensitive to \ndocument ranking, so therefore can be used to obtain a fair comparison between \nmethods. The results are shown in Table 1. \n\nAs can bee seen from the table, the variable-length subsequence method performs \nas well as or better than all other methods and achieves a perfect ranking for \ndocuments in one of the categories. \n\n\fMethod \nWeighting TFIDF Uniform \nearn \nacq \nmoney-fx \ngrain \ncrude \ntrade \ninterest \nship \nwheat \ncorn \n\n99.91 \n99.61 \n82.43 \n99.67 \n98.23 \n95.53 \n98.83 \n99.42 \n98.7 \n98.2 \n\nBoW \n\n99.86 \n99.62 \n80.54 \n99.69 \n98.52 \n95.29 \n91.61 \n96.84 \n98.52 \n98.95 \n\nngrams \n\nFSA \n\nIn 1;: Uniform \n99.9 \n99.5 \n83.4 \n99.4 \n97.2 \n95.6 \n95.4 \n98.9 \n99.3 \n99.0 \n\n99.9 \n99.7 \n86.5 \n97.8 \n100.0 \n94.6 \n94.0 \n92.7 \n95.3 \n97.5 \n\nIn 1;: \n99.9 \n99.7 \n85.8 \n97.5 \n100.0 \n91.3 \n88.8 \n98.4 \n98.4 \n98.1 \n\n1;: \n96.4 \n99.7 \n84.9 \n99.9 \n99.9 \n94.6 \n96.6 \n91.7 \n97.2 \n99.3 \n\nTable 1: Average precision results comparing TFIDF, n-gram and FSM features on \nthe top 10 categories of the reuters data set. \n\n6 Discussion \n\nIn this paper we have shown how the string kernel can be thought of as a k-stage \nMarkov process, and as a result interpreted as a Fisher kernel. Using this new \ninsight we have shown how the features of a Fisher kernel can be constructed using \na Finite State Model parameterisation which reflects the statistics of the frequency \nof occurance of features within the corpus. This model has then been extended \nfurther to incorporate sub-sequences of varying length, which is a great deal more \nflexible than the fixed-length approach. A procedure for determining informative \nsub-sequences (states in the FSM model) has also been given. Experimental results \nhave shown that this model outperforms the standard tfidf bag of words model on \na well known data set. Although the experiments in this paper are not extensive, \nthey show that the approach of using a Finite-State-Model to generate a Fisher \nkernel gives new insights and more flexibility over the string kernel, and performs \nwell. Future work would include determining the optimum value for the threshold \nt (maximum frequency of a sub-string occurring within the FSM before a state is \nexpanded) as this currently has to be set a-priori. \n\nReferences \n\n[1] D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-\n\n99-10, University of California, Santa Cruz, July 1999. \n\n[2] T. Jaakkola, M. Diekhaus, and D. Haussler. Using the fisher kernel method to detect \n\nremote protein homologies. 7th Intell. Sys. Mol. Bio!. , pages 149- 158, 1999. \n\n[3] T. Joachims. Making large-scale svm learning practical. In B. Schiilkopf, C. Burges, \n\nand A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT(cid:173)\nPress, 1999. \n\n[4] Y. Li, H. Zaragoza, R. Herbrich, J. Shawe-Taylor, and J. Kandola. The perceptron \n\nalgorithm with uneven margins. In Proceedings of the Nineteenth International Con(cid:173)\nference on Machine Learning (ICML '02), 2002. \n\n[5] H Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and Watkins C. Text clas(cid:173)\nsification using string kernels. Journal of Machine Learning Research, (2):419- 444, \n2002. \n\n[6] H. Lodhi, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification using \nstring kernels. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in \nNeural Information Processing Systems 13, pages 563- 569. MIT Press, 2001. \n\n[7] C. Watkins. Dynamic alignment kernels. Technical Report CSD-TR-98-11, Royal \n\nHolloway, University of London, January 1999. \n\n\f", "award": [], "sourceid": 2327, "authors": [{"given_name": "Craig", "family_name": "Saunders", "institution": null}, {"given_name": "Alexei", "family_name": "Vinokourov", "institution": null}, {"given_name": "John", "family_name": "Shawe-taylor", "institution": null}]}