{"title": "A New Discriminative Kernel From Probabilistic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 977, "page_last": 984, "abstract": null, "full_text": "A New Discriminative Kernel From Probabilistic Models \n\nK. Tsuda,*tM. Kawanabe,* G. Ratsch,\u00a7*S . Sonnenburg,* and K.-R. Muller*+ \n\nt AIST CBRC, 2-41-6, Aomi, Koto-ku , Tokyo, 135-0064, Japan \n\n* Fraunhofer FIRST, Kekulestr. 7, 12489 Berlin, Germany \n\n\u00a7 Australian National U ni versi ty, \n\nResearch School for Information Sciences and Engineering, \n\nCanberra, ACT 0200 , Australia \n\n+ University of Potsdam, Am Neuen Palais 10, 14469 Potsdam, Germany \n\nko ji. tsuda@aist.go.jp, nabe@first.fraunhofer.de , \n\nGunnar.Raetsch@anu.edu.au , {sonne, klaus }@first.fraunhofer.de \n\nAbstract \n\nRecently, Jaakkola and Haussler proposed a method for construct(cid:173)\ning kernel functions from probabilistic models. Their so called \n\"Fisher kernel\" has been combined with discriminative classifiers \nsuch as SVM and applied successfully in e.g. DNA and protein \nanalysis. Whereas the Fisher kernel (FK) is calculated from the \nmarginal log-likelihood, we propose the TOP kernel derived from \nTangent vectors Of Posterior log-odds. Furthermore we develop \na theoretical framework on feature extractors from probabilistic \nmodels and use it for analyzing FK and TOP. In experiments our \nnew discriminative TOP kernel compares favorably to the Fisher \nkernel. \n\nIntroduction \n\n1 \nIn classification tasks , learning enables us to predict the output y E {-1 , + 1} of \nsome unknown system given the input a! E X based on the training examples \n{a!i ' y;}i=l' The purpose of a feature extractor f : X --+ ]RD is to convert the \nrepresentation of data without losing the information needed for classification [3] . \nWhen X is a vector space like ]Rd , a variety of feature extractors have been proposed \n(e.g. Chapter 10 in [3]) . However, they are typically not applicable when X is a \nset of sequences of symbols and does not have the structure of a vector space as in \nDNA or protein analysis [2]. \n\nRecently, the Fisher kernel (FK) [6] was proposed to compute features from a proba(cid:173)\nbilistic model p( a!, Y 18). At first, the parameter estimate 9 is obtained from training \nexamples . Then , the tangent vector of the log marginal likelihood log p( ~ 19) is used \nas a feature vector. The Fisher kernel refers to the inner product in this feature \nspace, but the method is effectively a feature extractor (also since the features are \ncomputed explicitly). The Fisher kernel was combined with a discriminative clas(cid:173)\nsifier such as SVM and achieved excellent classification results in several fields, for \nexample in DNA and protein analysis [6, 5]. Empirically, it is reported that the \nFK-SVM system often outperforms the classification performance of the plug-in es-\n\n\ftimate. 1 Note that the Fisher kernel is only one possible member in the family of \nfeature extractors f iJ (re ) : X --+ ]RD that can be derived from probabilistic models. \nWe call this family \"model-dependent feature extractors\" . Exploring this family is \na very important and interesting subject. \n\nSince model-dependent feature extractors have been newly developed, the perfor(cid:173)\nmance measures for them are not yet established. We therefore first propose two \nperformance measures . Then, we define a new kernel (or equivalently a feature \nextractor) derived from the Tangent vector Of Posterior log-odds - that we denote \nas TOP kernel. vVe will analyze the performance of the TOP kernel and the Fisher \nkernel in terms of our performance measures. Then the TOP kernel is compared \nfavorably to the Fisher kernel in a protein classification experiment. \n\n2 Performance Measures \nTo begin with, let us describ e the notations. Let re E X be the input 'point' and \ny E {-1 , +1 } be the class label. X may be a finite set or an infini te set like ]Rd. \nLet us assume that we know the parametric model of the joint probability p( re, y19) \nwhere 9 E]RP is the parameter vector. Assume that the model p(re,yI9) is regular \n[7] and contains the true distribution. Then the true parameter 9 * is uniquely \ndetermined. Let iJ be a consistent estimator [1] of 9 *, which is obtained by n training \nexamples drawn i.i.d. from p(re , YI9*). Let oed = of 108i , Vof = (OeJ, ... ,Oep !) T , \nand V~f denote a p X P matrix whose (i,j) element is 0 2 f 1(08i 08j ) . \n\nAs the Fisher kernel is commonly used in combination with linear classifiers such \nas SVMs, one reasonable performance measure is the classification error of a linear \nclassifier wTfiJ (re) + b (w E]RD and b E]R) in the feature space. Usually wand b \nare learned, so the optimal feature extractor is different with regard to each learning \nalgorithm. To cancel out this ambiguity and to make a theoretical analysis possible, \nwe assume the optimal learning algorithm is used. When wand b are optimally \nchosen, the classification error is \n\nR(fiJ) = min E\"\"y*[-y(w T fiJ(re ) + b)], \n\nwES ,bE~ \n\n(2 .1 ) \n\nwhere S = {w l llwi l = 1,w E ]RD }, ** [a] is the step function which is 1 when \na > 0 and otherwise 0, and E\"\"y denotes the expectation with respect to the true \ndistribution p( re, y 19*). R(f iJ) is at least as large as the Bayes error L * [3] and \nR(f iJ) = L * only if the linear classifier implements the same decision rule as the \nBayes optimal rule. \n\nAs a related measure, we consider the estimation error of the posterior probability \nby a logistic regressor F(w T fiJ(re ) + b), with e.g. F(t) = 1/ (1 + exp( -t)): \n\nD(fiJ) = min E\",IF(w T fo(re ) + b) - P(y = +1Ire,9*)I. \n\nwE~D ,bE~ \n\n(2 .2) \n\nThe relationship between D(fiJ ) and R(fiJ) is illustrated as follows: Let L be the \nclassification error rate of a posterior probability estimator P(y \n+ lire). With \nregard t o L, the following inequality is known[l]: \n\nL - L* :s; 2E.,IP(y = +l lre ) - P(y = +1 Ire , 9*)I. \n\n(2 .3) \n\nWhen P(y \nrelationship between the two measures \n\n+llre):= F(w T fiJ(re) + b) , this inequality leads to the following \n\n(2.4) \n1 In classification by plug-in estimate, re is classified by t hresholding the posterior prob(cid:173)\n\nR(fiJ) - L* :s; 2D(fiJ)\u00b7 \n\n---------------------------\n\nability fj = sign(P(y = +llre, 0) - 0.5) [1]. \n\n\fSince D(fo ) is an upper bound on R(fo), it is useful to derive a new kernel to \nminimize D(f 0) ' as will be done in Sec. 4. \n\n3 The Fisher kernel \nThe Fisher kernel (FK) is defined 2 as K (;e , ;e' ) = s(;e ,iJ )TZ-1(iJ)s (;e' ,iJ) , where s \nis the Fisher score \n\ns(;e ,iJ ) = (otl1logp(;eliJ) , ... ,Otlp 10gp( ;eliJ )) T = \\7e logp(;e ,iJ ), \n\nand Z is the Fisher information matrix: Z(9) = E\", [s(;e,9)s(;e,9)TI9]. The theo(cid:173)\nretical foundation of FK is described in the following theorem [6]: \"a kernel classifier \nemployed the Fisher kernel derived from a model that cont ains the label as a latent \nvariable is, asymptotically, at least as good a classifier as t he MAP labeling based \non the model\" . The theorem says that the Fisher kernel can perform at least as \nwell as the plug-in estimate, if the parameters of linear classifier are properly de(cid:173)\nt ermined (cf. Appendix A of [6]). With our performance measure, this t heorem can \nbe represented more concisely: R(f 0) is bounded by the classification error of the \nplug-in estimate \n\nR(fo) :S; E\"\"y** [- y(P(y = + ll;e,iJ ) - 0. 5)] . \n\n(3.1 ) \n\nNot e that the classification rule constructed by the plug-in estimate P( y = + 11;e , iJ) \ncan also be realized by a linear classifier in feat ure space. Propert y (3.1) is important \nsince it guarantees that the Fisher kernel performs better when the optimal w and b \nare available. However, the Fisher kernel is not the only one to satisfy this inequality. \nIn the next section, we present a new kernel which satisfies (3.1) and has a more \nappealing theoretical property as well. \n\n4 The TOP Kernel \nDefinition Now we proceed to propose a new kernel: Our aim is to obtain a feature \nextractor that achieves small D(f 0). When a feature extractor! 0 (;e) satisfies3 \n\nW T !o(;e ) + b = p -1( p(y = + 11;e , 9 *)) for all;e E X \n\n(4.1 ) \nwith certain values of w and b, we have D(f 0) = O. However, since the true \nparameter 9 * is unknown, all we can do is t o construct ! 0 which approximately \nsatisfies (4.1). Let us define \n\nv( ;e,9) = p-1 (p (y = +11;e , 9 )) = 10g( P( y = +11;e,9 )) -log(P(y = -11;e,9) ), \n\nwhich is called the posterior log-odds of a probabilistic model [1]. By Taylor ex(cid:173)\npansion around the estimate iJ up to t he first order4 , we can approximate v( ;e,9*) \nas \n\nv( ;e,9*) ~ v(;e,iJ) + L0tliv( ;e ,iJ)(e: - ad. \n\nl' \n\n( 4.2) \n\ni=l \n\n2In practice, some variants of the Fisher kernel are used. For example, if the derivative \nof each class distribution , not marginal , is taken, the feature vector of FK is quite similar \nto that of our kernel. However , these variants should b e deliberately discriminat ed from \nthe Fisher kernel in theoretical discussions. Throughout this pap er including experiments, \nwe adopt the origi nal definit ion of the F isher kernel from [6] . \n\n3Notice t h at p- l (t) = log t - log(l -\n40 ne can easily derive TOP kernels from higher order Taylor expansions . However, we \nwill only deal with t he first order expansion here, because higher order expansio ns would \ninduce extremely high dimensional feature vectors in practical cases. \n\nt) \n\n\fThus, by setting \n\nand \n\nw:= w * = (1, 8; - el , \u00b7\u00b7\u00b7 , 8; - ep)T, b = 0, \n\n( 4.3) \n\n(4.4) \n\nequation (4.1) is approximately satisfied. Since a Tangent vector Of the Posterior \nlog-odds const itutes the main part of the feature vector, we call the inner product \nof the two feature vectors \"TOP kernel\" : \n\n( 4.5) \n\nIt is easy to verify that the TOP kernel satisfies (3.1) , b ecause we can construct \nthe same decision rule as the plug-in estimate by using the first element only (i.e. \nw = (1 , 0, . .. ,0), b = 0). \n\nA Theoretical Analysis \nIn this section, we compare the TOP kernel with the plug(cid:173)\nin es timate in terms of p erformance measures . Later on, we assume that 0 < P (y = \n+1Ial,8 ) < 1 to prevent IV( al,8)1 from going to infinity. Also, it is assumed that \nVeP (y = +1Ial , 8) and V~P ( y = +1Ial,8 ) are bounded. Substituting the plug-in \nestimate denoted by the subscript IT into D(fo ), we have \n\nDefine 68 = 8 - 8*. By Taylor expansion around 8* , we have \n\nwhere 8 0 = 8* + \"(68 (O :S \"( :S 1). When the TOP kernel is used, \nD(fo) :S E\",IF((w* )T fo(al)) - P (y = +1 Ial,8*)I , \n\n( 4.7) \n\nwhere w* is defined as in (4.4). Since P is Lipschitz-continuous, there is a finit e \npositive constant M such that IP(a) - P(b)1 :S Mia - bl. Thus, \n\nD(fo) :S ME\",I(w *)T fo (rn ) - P-l (P (y = + 1Irn , 8* )) I\u00b7 \n\n(4.8) \nSince (w* ) T f 0 (al ) is the Taylor expansion of p - 1 (P(y = + 11al , 8 *)) up to the first \norder (4.2) , the first order t erms of 68 are excluded from the right side of (4.8 ), \nthus D(fo ) = 0 (11 68 112 ) . Since both, the plug-in and the TOP kernel, depend on \nthe parameter estimate 8, the errors D,,(8) and D(fo) become smaller as 1168 11 \ndecreases. This shows th at if w and b are optimally chosen , t he rate of convergence \nof the TOP kernel is much faster than that of the plug-in estimate. \nThis result is closely related to large sample performances : Assuming that 8 is a \nn 1/ 2 consistent estimator with asymptotic normality (e.g. the maximum likelihood \nestimator) , we have 116811 = Op(n- l / 2 )[7J, where 01' denotes stochastic order cf. [1]. \nSo we can directly derive the convergence order as D,,(8) = Op (n- l / 2 ) and D(f 0) = \nOp( n - l ). By using the relation (2.4) , it follows that R,,(8 ) - L* = Op( n - l / 2 ) and \nR(f 0) - L * = Op (n- l ).5 Therefore, t he TOP kernel has a much b etter convergence \nrate in R(f 0)' which is a strong motivation to use the TOP kernel instead of plug-in \nestimate. \n\n5Fo r detail ed discussion a bout t he convergence orders of classification e rror, see C ha pte r \n\n6 of [1] \n\n\fHowever, we must notice that this fast rate is possible only when the optimal lin(cid:173)\near classifier is combined with the TOP kernel. Since non-optimal linear classifiers \ntypically have the rate Op(n- 1 / 2 ) [1], the overall rate is dominated by the slower \nrate and turns out to be Op (n - 1 / 2 ) . But this theoretical analysis is still meaning(cid:173)\nful, because it shows the existence of a very efficient linear boundary in the TOP \nfeature space. This result encourages practical efforts to improve linear boundaries \nby engineering loss functions and regularization terms with e.g. cross validation, \nbootstrapping or other model selection criteria [1]. \n\nExponential Family: A Special Case \n\u00b7When the distribution of two classes belong \nto the exponential family, the TOP kernel can achieve an even better result than \nshown above. Distributions of the exponential family can be written as q( re , 11) = \nexp( 11 T t (;I!) +~( 11)) , where t (;I!) is a vector-valued function called sufficient statistics \nand ~ ( 11) is a normalization factor [4]. Let 0' denote the parameter for class prior \nprobability of the positive model P( y = +1). Then, the probabilistic model IS \ndescribed as \n\nwhere 8 = {O', 11+1 ' 11 - 1}\u00b7 The posterior log-odds reads \n\nThe TOP feature vector is described as \n\niiJ(;I! ) = (v( ;I! ,8) ,Oav(re ,8 ), V'7+1 v(re , 8 ) \n\nA \n\nA \n\nAT \n\n, V'7 _1 v(;I!,8) ) \n\nATT \n. \n\nwhere V'7 ,v(;I!,iJ ) = s(ts(re) + V'7,~s(f,s)) for s = {+1,-1}. So, when w = \n(1,0, 11+1 - \"'+1,11\"-1 - \"'- 1) T and b is properly set, the true log-odds p - l (P(y = \n+11;I!,8*)) can be constructed as a linear function in the feature space (4.1). Thus \nDUiJ) = 0 and RUiJ) = L*. Furthermore, since each feature is represented as \na linear function of sufficient statistics t+1 (re) and t - l (re), one can construct an \nequivalent feature space as (t+ l (re) T, Ll (re) T) T without knowing iJ. This result is \nimportant because all graphical models without hidden states can be represented \nas members of the exponential family, for example markov models [4] . \n\n5 Experiments on Protein Data \n\nIn order to illustrate that the TOP kernel works well for real-world problems, we \nwill show the results on protein classification. The protein sequence data is obtained \nfrom the Superfamily website. 6 This site provides sequence files with different de(cid:173)\ngrees of redundancy filtering ; we used the one with 10% redundancy filtering. Here, \n4541 sequences are hierarchically labeled into 7 classes, 558 folds, 845 superfam(cid:173)\nilies and 1343 families according to the SCOP(1.53) scheme. In our experiment , \nwe determine the top category \"classes\" as the learning target. The numbers of \nsequences in the classes are listed as 791, 1277, 1015, 915,84,76,383 . We only use \nthe first 4 classes, and 6 two-class problems are generated from all pairs among t he \n4 classes . The 5th and 6th classes are not used because the number of examples is \ntoo small. Also, the 7th class is not used because this class is quite different from \nthe others and too easy to classify. In each two-class problem , the examples are \nrandomly divided into 25% training set, 25% validation set and 50% t est set. The \nvalidation set is used for model selection. \n\n6http://stash.mrc-lmb.cam.ac.uk/S UPERFAMILY / \n\n\fAs a probabilistic model for protein sequences, we make use of hidden markov mod(cid:173)\nels [2] with fully connected states. 7 The Baum-Welch algorithm (e.g. [2]) is used \nfor maximum likelihood training. To construct FK and TOP kernels, the deriva(cid:173)\ntives with respect to all paramet ers of the HMMs from both classes are included. \nThe derivative with respect to the class prior probability is included as well: Let \nq( OIl , e) be the probability density function of a HMM. Then, the marginal di stri-\nbution is written as p(ocI8) = aq( oc, e+1 ) + (1- a)q( oc, L1) , where a is a parameter \ncorresponding to the class prior. The feature vector of FK consists of the following: \n\nV'e, logp( oc I8) \n\n00: logp(oc I8) \n\nP (y=s loc , 8)V'e, logq(oc ,es) \n1 \n--;;-P(y = +1 1\u00b0c , 9) - - -, P(y = -11\u00b0c, 9) , \na \n\nI -a \n\n' \n\n,\n\n1 \n\nSE {-l,+l } \n\n(5.1 ) \n\n(5.2) \n\nsV'e , logq( oc ,es) s = \nwhile the feature vector of TOP includes V'e ,v( oc ,8) \n{+ 1, _ 1}.8 We obtained e+1 and e-1 from the training examples of respective \nclasses and set a = 0.5. \nIn the definition of the TOP kernel (4.5), we did not \ninclude any normalization of feature vectors. However, in practical situations, it is \neffective to normalize feature s for improving classification performance. Here, each \nfeature of the TOP kernel is normalized to have mean 0 and variance 1. Also in FK, \nwe normalized the features in the same way instead of using the Fisher information \nmatrix, because it is difficult to estimate it reliably in a high dimensional parameter \nspace. Both, the TOP kernel and FK are combined with SVMs with bias terms. \n\nWhen classifying with HMMs , one observes the difference of the log-likelihoods for \nthe two classes and discriminat es by thresholding at an appropriate value. Theo(cid:173)\nretically, this threshold should be determined by the (true) class prior probability. \nBut, this is typically not available. Furthermore the estimation of the prior prob(cid:173)\nability from training data often leads to poor results [2] . To avoid this problem, \nthe threshold is determined such that the false positive rate and the false negative \nrate are equal in the test set. This threshold is determined in the same way for \nFK-SVMs and TOP-SVMs. \n\nThe hybrid HMM-TOP-SVM system has several model parameters: the number \nof HMM states, the pseudo count value [2] and the regularization parameter C of \nthe SVM. vVe determine these parameters as follows: First, the number of states \nand the pseudo count value are determined such that the error of the HMM on \nthe validation set (i.e. validation error) is minimized. Based on the chosen HMM \nmodel, the paramet er C is det ermined such that the validation error of TOP-SVM is \nminimized. Here, the number of states and the pseudo count value are chosen from \n{3, 5,7,10,15,20,30,40, 60} and {l0-10, 10- 7 , 10 - 5 , 10- 4 ,10 - 3 , 1O- 2 }, respectively. \nFor C, 15 equally spaced points on the log scale are taken from [10-4 ,101]. Note \nthat the model selection is performed in the same manner for the Fisher kernel as \nwell. \n\nThe error rates over 15 different training/validation/test divisions are shown in Fig(cid:173)\nure 1 and 2. The results of statistical tests are shown in Table 1 as well. Compared \nwith the plug-in estimate, the Fisher kernel performed significantly better in sev(cid:173)\neral settings (i.e. 1-3, 2-3, 3-4). This result partially agrees with observations in \n[6]. However, our TOP approach significantly outperforms the Fisher kernel: Ac(cid:173)\ncording to the Wilcoxson signed ranks test, the TOP kernel was significantly better \n\n7Several HMM models have been engineered for protein classification [2]. However, we \ndo not use such HMMs because the main purpose of experiment is to compare FK and \nTOP. \n\n8 0aV (OC, 0) is a constant which does not depend on OIl. So it is not included in the \n\nfeature vector. \n\n\f0.3 \n\n0.25 \n\n0.2 \n\n0.1 6 \n\n0.1 4 \n\n0.1 2 \n\n0.1 \n\n0.08 \n\n0.18 \n\n0.16 \n\n0.14 \n\n0.12 \n\n0.1 \n\n1-2 \n\n~ \n\nFK \n2-3 \n\nP \n\n~ t8 \n\nP \n\nFK \n\nTOP \n\n~ \n\nTOP \n\nP \n\nP \n\n0.36 1 \n\n0.15 \n\n0.32 \n\n0. 32 \n\n0.3 \n\n0.26 \n\n0. 24 \n\n$ 0. 281 \n\n~ ~ \n\"\"I ~ ~ \n0.3 ! :::1 , ~ \n\nffi \n\n~ \n\nTOP \n\n0 3\n0 28 \nI \n\n1 \n\nFK \n2-4 \n\n032 , \n\nTOP \n\n0 4\n\n1 \n\nFK \n\n3-4 \n\n0.28 \n\n0.26 \n\n0.24 \n\n0.22 \n\n0.2 \n\nP \n\nFK \n\nP \n\nFK \n\n034 \n\n1-3 \n\n1-4 \n\n~ \n\nTOP \n\n~ \n\nTOP \n\nFigure 1: The error rates of SVMs with two feature extractors in the protein clas(cid:173)\nsification experiments. T he labels 'P ','FK' ,'TOP' denote the plug-in estimate, the \nFisher kernel and t he TOP kernel, respect ively. T he t itle on each subfigure shows \nthe two prot ein classes used for t he experiment. \n\n1-2 \n\n0.14 \n\n0.12 \n\n0.1 \n\n0.08 \n\n0\u00b7ct~06 0.08 \n\n0.1 \n\n0.12 0.14 TOP \n\nFigure 2: Comparison of the error rates \nof the F isher kernel and the TO P ker(cid:173)\nnel in discrimination between class 1 and \n2. Every point corresponds to one of \n15 differen t t raining/validation /test set \nsplits. Except two cases, t he T OP kernel \nachieves smaller error rates . \n\nin all settings. Also, the t-test judged t hat t he difference is significant except for \n1-4 and 2-4. This indicates that the T OP kernel was able to capture discriminative \ninformation better than the Fisher kernel. \n\n6 Conclusion \n\nIn this study, we presented the new discriminative TOP kernel derived from prob(cid:173)\nabilistic models. Since the theoret ical framework for such kernels has so far not \nbeen established, we proposed two performance measures to analyze them and gave \nbounds and rates to gain a bett er insigh t into model dependent feat ure extractors \nfrom probabilistic models. Experimentally, we showed that the T OP kernel com(cid:173)\npares favorably to F K in a realistic protein classification experiment . Note t hat \nSmith and Gales[8] have shown t hat a similar approach works excellent ly in speech \nrecogni tion tasks as well. Future research will focus on constructing small sam ple \nbounds for t he T OP kernel to extend the validity of t his work. Since other nonlinear \ntransformations F are possible for obtaining different and possibly b etter features, \nwe will furt hermore consider to learn t he nonlinear transformation F from train(cid:173)\ning samples . An interes ting point is that so far T OP kernels perform local linear \napproximations, it would be interesting to move in the direction of local or even \n\n\fTable 1: P-values of statistical tests in the protein classification experiments . Two \nkinds of tests, t- test (denoted as T in the table) and Wilcoxson signed ranks test \n(i.e. WX) , are used. When the difference is significant (p-value < 0.05), a single \nstar * is put beside the value. Double stars ** indicate that the difference is very \nsignificant (p-value < 0.01). \nI Methods I Test II \n\nP, FK \n\nP, TOP \n\nFK,TOP \n\nT \n\nWX \n\nT \n\nWX \n\nT \n\nWX \n\nI Methods I Test II \n\nP, FK \n\nP, TOP \n\nFK,TOP \n\nT \n\nWX \n\nT \n\nWX \n\nT \n\nWX \n\n1-2 \n0.95 \n0.85 \n0.015* \n\n4.3 X 10-4** \n\n0.0093** \n\n8.5 X 10- 4 ** \n\n2-3 \n\n0.0032** \n0.0040** \n\n3.0 X 10 -1~** \n6.1 X 10- 5** \n2.6 X 10 -M* \n6.1 X 10- 5** \n\n1-3 \n0.14 \n0.041 * \n\n1.7 X 10 - ~** \n6.1 X 10-5** \n2.2 X 10 -4** \n6.1 X 10- 5** \n\n1-4 \n0.78 \n0.24 \n0.11 \n0.030* \n0.21 \n0.048* \n\n2-4 \n0.79 \n0.80 \n0.059 \n0.035* \n0.079 \n\n0.0034** \n\n3-4 \n0.12 \n0.026* \n\n5.3 X 10 -0** \n3.1 X 10- 4 ** \n\n0.0031 ** \n\n1.8 X 10- 4** \n\nglobal nonlinear expansions. \n\nAcknowledgments vVe thank T. Tanaka, M. Sugiyama, S.-I. Amari, K. Karplus, \nR. Karchin, F. Sohler and A. Zien for valuable discussions. Moreover, we gratefully \nacknowledge partial support from DFG (JA 379/9-1, MU 987/1-1) and travel grants \nfrom EU (Neurocolt II). \nReferences \n\n[1] L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recog(cid:173)\n\nnition. Springer, 1996. \n\n[2] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: \nProbabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, \n1998. \n\n[3] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, \n\nSan Diego, 2nd edition, 1990. \n\n[4] D. Geiger and C. Meek. Graphical models and exponential famili es . Technical \n\nReport MSR-TR-98-10, Microsoft Research, 1998. \n\n[5] T.S. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for \n\ndetecting remote protein homologies. J. Compo Biol. , 7:95-114, 2000. \n\n[6] T.S. Jaakkola and D. Haussler. Exploiting generative models in discriminative \nclassifiers. In M.S. Kearns, S.A. SoHa, and D.A. Cohn, editors, Advances in \nNeural Information Processing Systems 11 , pages 487- 493. MIT Press, 1999. \n[7] N. Murata, S. Yoshizawa, and S. Amari. Network information criterion -\n\ndetermining the number of hidden units for an artificial neural network model. \nIEEE Trans. Neural Networks, 5:865- 872, 1994. \n\n[8] N. Smith and M. Gales. Speech recognition using SVMs. In T.G. Dietterich, \n\nS. Becker, and Z. Ghahramani , editors, Advances in Neural Information Pro(cid:173)\ncessing Systems 14. MIT Press, 2002. to appear. \n\n\f", "award": [], "sourceid": 2014, "authors": [{"given_name": "Koji", "family_name": "Tsuda", "institution": null}, {"given_name": "Motoaki", "family_name": "Kawanabe", "institution": null}, {"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "S\u00f6ren", "family_name": "Sonnenburg", "institution": null}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": null}]}*