{"title": "Sequence and Tree Kernels with Statistical Feature Mining", "book": "Advances in Neural Information Processing Systems", "page_first": 1321, "page_last": 1328, "abstract": "", "full_text": "Sequence and Tree Kernels\n\nwith Statistical Feature Mining\n\nJun Suzuki and Hideki Isozaki\n\n2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto,619-0237 Japan\n\nNTT Communication Science Laboratories, NTT Corp.\n{jun, isozaki}@cslab.kecl.ntt.co.jp\n\nAbstract\n\nThis paper proposes a new approach to feature selection based on a sta-\ntistical feature mining technique for sequence and tree kernels. Since\nnatural language data take discrete structures, convolution kernels, such\nas sequence and tree kernels, are advantageous for both the concept and\naccuracy of many natural language processing tasks. However, experi-\nments have shown that the best results can only be achieved when lim-\nited small sub-structures are dealt with by these kernels. This paper dis-\ncusses this issue of convolution kernels and then proposes a statistical\nfeature selection that enable us to use larger sub-structures effectively.\nThe proposed method, in order to execute ef\ufb01ciently, can be embedded\ninto an original kernel calculation process by using sub-structure min-\ning algorithms. Experiments on real NLP tasks con\ufb01rm the problem in\nthe conventional method and compare the performance of a conventional\nmethod to that of the proposed method.\n\n1 Introduction\n\nSince natural language data take the form of sequences of words and are generally analyzed\ninto discrete structures, such as trees (parsed trees), discrete kernels, such as sequence\nkernels [7, 1] and tree kernels [2, 5], have been shown to offer excellent results in the\nnatural language processing (NLP) \ufb01eld. Conceptually, these proposed kernels are de\ufb01ned\nas instances of convolution kernels [3, 11], which provides the concept of kernels over\ndiscrete structures.\n\nHowever, unfortunately, experiments have shown that in some cases there is a critical issue\nwith convolution kernels in NLP tasks [2, 1, 10]. That is, since natural language data\ncontain many types of symbols, NLP tasks usually deal with extremely high dimension\nand sparse feature space. As a result, the convolution kernel approach can never be trained\neffectively, and it behaves like a nearest neighbor rule. To avoid this issue, we generally\neliminate large sub-structures from the set of features used. However, the main reason for\nusing convolution kernels is that we aim to use structural features easily and ef\ufb01ciently.\nIf their use is limited to only very small structures, this negates the advantages of using\nconvolution kernels.\n\nThis paper discusses this issue of convolution kernels, in particular sequence and tree ker-\n\n\fnels, and proposes a new method based on statistical signi\ufb01cant test. The proposed method\ndeals only with those features that are statistically signi\ufb01cant for solving the target task,\nand large signi\ufb01cant sub-structures can be used without over-\ufb01tting. Moreover, by us-\ning sub-structure mining algorithms, the proposed method can be executed ef\ufb01ciently by\nembedding it in an original kernel calculation process, which is de\ufb01ned by the dynamic-\nprogramming (DP) based calculation.\n\n2 Convolution Kernels for Sequences and Trees\n\nConvolution kernels have been proposed as a concept of kernels for discrete structures,\nsuch as sequences, trees and graphs. This framework de\ufb01nes the kernel function between\ninput objects as the convolution of \u201csub-kernels\u201d, i.e. the kernels for the decompositions\n(cid:80)\n(parts or sub-structures) of the objects. Let X and Y be discrete objects. Conceptually,\nconvolution kernels K(X, Y ) enumerate all sub-structures occurring in X and Y and then\ncalculate their inner product, which is simply written as: K(X, Y ) = (cid:104)\u03c6(X), \u03c6(Y )(cid:105) =\ni \u03c6i(X) \u00b7 \u03c6i(Y ). \u03c6 represents the feature mapping from the discrete object to the feature\nspace; that is, \u03c6(X) = (\u03c61(X), . . . , \u03c6i(X), . . .). Therefore, with sequence kernels, input\nobjects X and Y are sequences, and \u03c6i(X) is a sub-sequence; with tree kernels, X and\nY are trees, and \u03c6i(X) is a sub-tree. Up to now, many kinds of sequence and tree kernels\nhave been proposed for a variety of different tasks. To clarify the discussion, this paper\nbasically follows the framework of [1], which proposed a gapped word sequence kernel,\nand [5], which introduced a labeled ordered tree kernel.\n\n1 ) whose size is n or less, where \u222an\n\nWe can treat that sequence is one of the special form of trees if we say sequences are rooted\nby their last symbol and each node has one child each of a previous symbol. Thus, in this\npaper, the word \u2018tree\u2019 is always including sequence. Let L be a set of \ufb01nite symbols. Then,\nlet Ln be a set of symbols whose sizes are n and P (Ln) be a set of trees that are constructed\nby Ln. The meaning of \u201csize\u201d in this paper is the the number of nodes in a tree. We denote\na tree u \u2208 P (Ln\n1 . Let T be a tree and\nsub(T ) be a function that returns a set of all possible sub-trees in T . We de\ufb01ne a function\nCu(t) that returns a constant, \u03bb(0 < \u03bb \u2264 1), if the sub-tree t covers u with the same root\nsymbol. For example, a sub-tree \u2018a-b-c-d\u2019, where \u2018a\u2019, \u2018b\u2019, \u2018c\u2019 and \u2018d\u2019 represent symbols\nand \u2018-\u2019 represents an edge between symbols, covers sub-trees \u2018d\u2019, \u2018a-c-d\u2019 and \u2018b-d\u2019. That\nis, Cu(t) = \u03bb if u matches t allowing the node skip, 0 otherwise. We also de\ufb01ne a function\n\u03b3u(t) that returns the difference of size of sub-trees t and u. For example, if t = a-b-c-d\nand u = a-b, then \u03b3u(t) = |4 \u2212 2| = 2.\nFormally, sequence and tree kernels can be de\ufb01ned as the same form as\n\nm=1Lm = Ln\n\nK SK,TK(T 1, T 2) =\n\nCu(t1)\u03b3u(t1)\n\nCu(t2)\u03b3u(t2).\n\n(1)\n\nu\u2208P (Ln\n1 )\n\nt1\u2208sub(T 1)\n\nt2\u2208sub(T 2)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nNote that this formula is also including the node skip framework that is generally introduced\nonly in sequence kernels[7, 1]; \u03bb is the decay factor that handles the gap present in sub-trees\nu and t.\n\nSequence and tree kernels are de\ufb01ned in recursive formula to calculate them ef\ufb01ciently\ninstead of the explicit calculation of Equation (1). Moreover, when implemented, these\nkernels can calculated in O(n|T 1||T 2|), where |T| represents the number of nodes in T , by\nusing the DP technique. Note, that if the kernel does not use size restriction, the calculation\ncost becomes O(|T 1||T 2|).\n\n\f3 Problem of Applying Convolution Kernels to Real tasks\n\nAccording to the original de\ufb01nition of convolution kernels, all of the sub-structures are\nenumerated and calculated for the kernels. The number of sub-structures in the input ob-\nject usually becomes exponential against input object size. The number of symbols, |L|,\nis generally very large number (i.e. more than 10,000) since words are treated as symbols.\nMoreover, the appearance of sub-structures (sub-sequences and sub-trees) are highly corre-\nlated with that of sub-structures of sub-structures themselves. As a result, the dimension of\nfeature space becomes extremely high, and all kernel values K(X, Y ) are very small com-\npared to the kernel value of the object itself, K(X, X). In this situation, the convolution\nkernel approach can never be trained effectively, and it will behave like a nearest neighbor\nrule; we obtain a result that is very precise but with very low recall. The details of this issue\nwere described in [2].\n\nTo avoid this, most conventional methods use an approach that involves smoothing the\nkernel values or eliminating features based on the sub-structure size. For sequence kernels,\n[1] use a feature elimination method based on the size of sub-sequence n. This means that\nthe kernel calculation deals only with those sub-sequences whose length is n or less. As\nwell as the sequence kernel, [2] proposed a method that restricts the features based on sub-\ntree depth for tree kernels. These methods seem to work well on the surface, however, good\nresults can only be achieved when n is very small, i.e. n = 2 or 3. For example, n = 3\nshowed the best performance for parsing in the experimental results of [2], and n = 2\nshowed the best for the text classi\ufb01cation task in [1]. The main reason for using these\nkernels is that they allow us to employ structural features simply and ef\ufb01ciently. When\nonly small-sized sub-structures are used (i.e. n = 2 or 3), the full bene\ufb01ts of the kernels\nare missed.\n\nMoreover, these results do not mean that no larger-sized sub-structures are useful. In some\ncases we already know that certain larger sub-structures can be signi\ufb01cant features for\nsolving the target problem. That is, signi\ufb01cant larger sub-structures, which the conventional\nmethods cannot deal with ef\ufb01ciently, should have the possibility of further improving the\nperformance. The aim of the work described in this paper is to be able to use any signi\ufb01cant\nsub-structure ef\ufb01ciently, regardless of its size, to better solve NLP tasks.\n\n4 Statistical Feature Mining Method for Sequence and Tree Kernels\n\nThis section proposes a new approach to feature selection, which is based on statistical\nsigni\ufb01cant test, in contrast to the conventional methods, which use sub-structure size.\n\nTo simplify the discussion, we restrict ourselves to dealing hereafter with the two-\nclass (positive and negative) supervised classi\ufb01cation problem.\nIn our approach, we\ntest the statistical deviation of all sub-structures in the training samples between the\nappearance of positive samples and negative samples, and then, select only the sub-\nstructures which are larger than a certain threshold \u03c4 as features. This allows us\nto select only the statistically signi\ufb01cant sub-structures.\nIn this paper, we explains\nour proposed method by using the chi-squared (\u03c72) value as a statistical metric.\nWe note, however, we can use many\ntypes of statistical metrics in our proposed\nmethod.\n\nTable 1: Contingency table and notation\nfor the chi-squared value\n\nFirst, we brie\ufb02y explain how to calculate\nthe \u03c72 value by referring to Table 1. c and\n\u00afc represent the names of classes, c for the\npositive class and \u00afc for the negative class.\nOij, where i \u2208 {u, \u00afu} and j \u2208 {c, \u00afc}, rep-\n\n(cid:80)\n\nu\n\u00afu\ncolumn Oc\n\nc\n\n\u00afc\nOuc Ou\u00afc\nO\u00afuc O\u00afu\u00afc\nO\u00afc\n\n(cid:80)\n\nrow\nOu\nO\u00afu\nN\n\n\fresents the number of samples in each case. Ou\u00afc, for instance, represents the number\nof u that appeared in \u00afc. Let N be the total number of training samples. Since N and\n(cid:80)\nOc are constant for training samples, \u03c72 can be obtained as a function of Ou and Ouc.\nThe \u03c72 value expresses the normalized deviation of the observation from the expectation:\ni\u2208{u,\u00afu},j\u2208{c,\u00afc}(Oij \u2212 Eij)2/Eij, where Eij = n \u00b7 Oi/n \u00b7 Oj/n, which\nchi(Ou, Ouc) =\nrepresents the expectation. We simply represent chi(Ou, Ouc) as \u03c72(u).\nIn the kernel calculation with the statistical feature selection, if \u03c72(u) < \u03c4 holds, that is, u\nis not statistically signi\ufb01cant, then u is eliminated from the features, and the value of u is\npresumed to be 0 for the kernel value. Therefore, the sequence and tree kernel with feature\nselection (SK,TK+FS) can be de\ufb01ned as follows:\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nK SK,TK+FS(T 1, T 2) =\n\nu\u2208{u|\u03c4\u2264\u03c72(u),u\u2208P (Ln\n\n1 )}\n\nt1\u2208sub(T 1)\n\nCu(t1)\u03b3u(t1)\n\nCu(t2)\u03b3u(t2).\n\nt2\u2208sub(T 2)\n\n(2)\nThe difference with their original kernels is simply the condition of the \ufb01rst summation,\nwhich is \u03c4 \u2264 \u03c72(u).\nThe basic idea of using a statistical metric to select features is quite natural, but it is not\na very attractive approach. We note, however, it is not clear how to calculate that kernels\nef\ufb01ciently with a statistical feature selection. It is computationally infeasible to calculate\n\u03c72(u) for all possible u with a naive exhaustive method. In our approach, we take ad-\nvantage of sub-structure mining algorithms in order to calculate \u03c72(u) ef\ufb01ciently and to\nembed statistical feature selection to the kernel calculation. Formally, sub-structure min-\ning is to \ufb01nd the complete set, but no-duplication, of all signi\ufb01cant (generally frequent)\nsub-structures from dataset. Speci\ufb01cally, we apply combination of a sequential pattern\nmining technique, Pre\ufb01xSpan [9], and a statistical metric pruning (SMP) method, Apriori\nSMP [8]. Pre\ufb01xSpan can substantially reduce the search space of enumerating all signif-\nicant sub-sequences. Brie\ufb02y saying, it \ufb01nds any sub-sequences uw whose size is n, by\nsearching a single symbol w in the projected database of the sub-sequence (pre\ufb01x) u of\nsize n \u2212 1. The projected database is a partial database which only contains all post\ufb01xes\n(pointers in the implementation) of appeared the pre\ufb01x u in the database. It starts search-\ning from n = 1, that is, it enumerates all the signi\ufb01cant sub-sequences by the recursive\ncalculation of pattern-growth, searching in the projected database of pre\ufb01x u and adding a\nsymbol w to u, and pre\ufb01x-projection, making projected database of uw.\n\nBefore explaining the algorithm of the proposed kernels, we introduce the upper bound of\nthe \u03c72 value. The upper bound of the \u03c72 value of a sequence uv, which is the concatenation\nof sequences u and v, can be calculated by the value of the contingency table of the pre\ufb01x\n\nu [8]: \u03c72(uv) \u2264 (cid:98)\u03c72(u) = max (chi(Ouc, Ouc), chi(Ou \u2212 Ouc, 0)) . This upper bound\nindicates that if(cid:98)\u03c72(u) < \u03c4 holds, no (super-)sequences uv, whose pre\ufb01x is u, can be larger\n\nthan threshold, \u03c4 \u2264 \u03c72(uv). In our context, we can eliminate all (super-)sequences uv\nfrom candidates of the feature without the explicit evaluation of uv.\n\nUsing this property in the Pre\ufb01xSpan algorithm, we can eliminate to evaluate all the (super-\n)sequences uv by evaluating the upper bound of sequence u. After \ufb01nding the number of\nindividual symbol w appeared in projected database of u, we evaluate uw in the following\n\nthree conditions: (1) \u03c4 \u2264 \u03c72(uw), (2) \u03c4 > \u03c72(uw), \u03c4 > (cid:98)\u03c72(uw), and (3) \u03c4 > \u03c72(uw),\n\u03c4 \u2264 (cid:98)\u03c72(uw). With condition (1), sub-sequence uw is selected as the feature. With condi-\n\ntion (2), uw is pruned, that is, all uwv are also pruned from search space. With condition\n(3), uw is not a signi\ufb01cant, however, uwv can be a signi\ufb01cant; thus uw is not selected as\nfeatures, however, mining is continue to uwv. Figure 1 shows an example of searching\nand pruning the sub-sequences to select signi\ufb01cant features by the Pre\ufb01xSpan with SMP\nalgorithm.\n\n\fFigure 1: Example of searching and pruning the sub-sequences by Pre\ufb01xSpan with SMP\nalgorithm\n\nFigure 2: Example of the string encoding for trees under the postorder traversal\n\nThe famous tree mining algorithm [12] cannot be simply applied as a feature selection\nmethod for the proposed tree kernels, because this tree mining executes preorder search of\ntrees while tree kernels calculate the kernel in postorder. Thus, we take advantage of the\nstring (sequence) encoding method for trees and treat them in sequence kernels. Figure 2\nshows an example of the string encoding for trees under the postorder traversal. The brack-\nets indicate the hierarchical relation between their left and right hand side nodes. We treat\nthese brackets as a special symbol during the sequential pattern mining phase. Sub-trees\nare evaluated as the same if and only if the string encoded sub-sequences are exactly the\nsame including brackets. For example, \u2018d ) b ) a\u2019 and \u2018d b ) a\u2019 are different.\n\nWe previously said that sequence can be treated as one of trees. We also encode in the case\nof sequence; for example a sequence \u2019a b c d\u2019 is encoded in \u2018((((a) b) c) d)\u2019. That is, we\ncan de\ufb01ne sequence and tree kernels with our feature selection method in the same form.\nSequence and Tree Kernels with Statistical Feature Mining: Sequence and Tree kernels\nwith our proposed feature selection method is de\ufb01ned in the following equations.\n\nK SK,TK+FS(T 1, T 2;D) =\n\nHn(T 1\n\nj ;D)\n\ni , T 2\n\n(cid:88)\n\n(cid:88)\n\n1\u2264i\u2264|T 1|\n\n1\u2264j\u2264|T 2|\n\n(3)\n\n(4)\n\nD represents the training data, and i and j represent indices of nods in postorder of T 1\nj ;D) be a function that returns the sum value of all\nand T 2, respectively. Let Hn(T 1\nstatistically signi\ufb01cant common sub-sequences u if t1\n\ni , T 2\n\n(cid:88)\n\ni = t2\nJu(T 1\n\nj and |u| \u2264 n.\ni , T 2\n\nj ;D),\n\nHn(T 1\n\nj ;D) =\n\ni , T 2\n\nu\u2208\u0393n(T 1\n\ni ,T 2\n\nj ;D)\n\ni , T 2\n\nwhere \u0393n(T 1\nj ;D) and J (cid:48)(cid:48)\nabove condition 1. Then, let Ju(T 1\ntions that calculate the value of the common sub-sequences between T 1\n\nj ;D) represents a set of sub-sequences, which is |u| \u2264 n, that satisfy the\nj ;D) be func-\nj recursively.\nj ;D),\n\nj) if uw \u2208(cid:98)\u0393n(T 1\n\nj ;D) \u00b7 Iw(t1\n\ni , T 2\ni and T 2\n\nj ;D), J (cid:48)\n\nu (T 1\n\ni , T 2\n\nu(T 1\n\ni , T 2\n\ni , T 2\n\n(cid:189)\n\ni , t2\n\nJuw(T 1\n\ni , T 2\n\nj ) =\n\n(5)\n\nJ (cid:48)\nu(T 1\n0 otherwise,\n\ni , T 2\n\n^abcde+1-1+1-1-1-1...classtraining dataa b c d a ec a e f b c dd b c a eb a c b ba c a dd a b d e c...()2'uc()2\u02c6'ucw3.21.54.80.22.51.90.90.95.21.81:22:3Projected database5:26:3bcde2.20.50.50.13.21.51.81.5c0.50.10.50.1dcd0.40.33.21.5ab2.20.51.21.2ae1.51.52.21.5d2.21.5thresholdn=1n=2n=3ce0.20.13.21.53:54:31:32:6Projected database4:56:4Sample id: pointerEx. 2:3Projected databaseselect as a featurepruning3,2,1,continue00.1=t()tc\u2021u2()()tctc<