{"title": "Exploiting Generative Models in Discriminative Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 487, "page_last": 493, "abstract": null, "full_text": "Exploiting generative models \n\n\u2022 In \n\ndiscriminative classifiers \n\nTommi S. Jaakkola* \n\nDavid Haussler \n\nMIT Artificial Intelligence Laboratorio \n\nDepartment of Computer Science \n\n545 Technology Square \nCambridge, MA 02139 \n\nUniversity of California \nSanta Cruz, CA 95064 \n\nAbstract \n\nGenerative probability models such as hidden ~larkov models pro(cid:173)\nvide a principled way of treating missing information and dealing \nwith variable length sequences. On the other hand , discriminative \nmethods such as support vector machines enable us to construct \nflexible decision boundaries and often result in classification per(cid:173)\nformance superior to that of the model based approaches. An ideal \nclassifier should combine these two complementary approaches. In \nthis paper, we develop a natural way of achieving this combina(cid:173)\ntion by deriving kernel functions for use in discriminative methods \nsuch as support vector machines from generative probability mod(cid:173)\nels. We provide a theoretical justification for this combination as \nwell as demonstrate a substantial improvement in the classification \nperformance in the context of D~A and protein sequence analysis. \n\n1 \n\nIntroduction \n\nSpeech, vision , text and biosequence data can be difficult to deal with in the context \nof simple statistical classification problems. Because the examples to be classified \nare often sequences or arrays of variable size that may have been distorted in par(cid:173)\nticular ways, it is common to estimate a generative model for such data, and then \nuse Bayes rule to obtain a classifier from this model. However. many discrimina(cid:173)\ntive methods, which directly estimate a posterior probability for a class label (as \nin Gaussian process classifiers [5]) or a discriminant function for the class label \n(as in support vector machines [6]) have in other areas proven to be superior to \n\n* Corresponding author. \n\n\f488 \n\nT. S. Jaakkola and D. Haussler \n\ngenerative models for classification problems. The problem is that there has been \nno systematic way to extract features or metric relations between examples for use \nwith discriminative methods in the context of difficult data types such as those \nlisted above. Here we propose a general method for extracting these discriminatory \nfeatures using a generative model. V{hile the features we propose are generally \napplicable, they are most naturally suited to kernel methods. \n\n2 Kernel methods \n\nHere we provide a brief introduction to kernel methods; see, e.g., [6] [5] for more \ndetails. Suppose now that we have a training set of examples Xl and corresponding \nbinary labels 51 (\u00b11) . In kernel methods. as we define them. the label for a new \nexample X is obtained from a weighted sum of the training labels. The weighting of \neach training label 52 consists of two parts: 1) the overall importance of the example \nXl as summarized with a coefficient '\\1 and 2) a measure of pairwise \"similarity\" \nbetween between XI and X, expressed in terms of a kernel function K(X2' X). The \npredicted label S for the new example X is derived from the following rule: \n\ns ~ sign ( ~ S, '\\,K(X,. X) ) \n\n(1) \n\nWe note that this class of kernel methods also includes probabilistic classifiers, in \n\\vhich case the above rule refers to the label with the maximum probability. The \nfree parameters in the classification rule are the coefficients '\\1 and to some degree \nalso the kernel function K . To pin down a particular kernel method. two things \nneed to be clarified. First, we must define a classification loss. or equivalently, the \noptimization problem to solve to determine appropriate values for the coefficients \n'\\1' Slight variations in the optimization problem can take us from support vector \nmachines to generalized linear models. The second and the more important issue is \nthe choice of the kernel function - the main topic of this paper. \\Ve begin with a \nbrief illustration of generalized linear models as kernel methods. \n\n2.1 Generalized linear models \n\nFor concreteness we consider here only logistic regression models. while emphasizing \nthat the ideas are applicable to a larger class of models l . \nIn logistic regression \nmodels , the probability of the label 5 given the example X and a parameter vector \ne is given by2 \n\nP(5IX. e) = (7 (5eT X) \n\n(2) \nwhere (7(z) = (1 + e- z) - l is the logistic function. To control the complexity of \nthe model when the number of training examples is small we can assign a prior \ndistribution p(e) over the parameters. \\Ve assume here that the prior is a zero \nmean Gaussian with a possibly full covariance matrix L:. The maximum a posteriori \n(l\\IAP) estimate for the parameters e given a training set of examples is found by \n\n1 Specifically. it applies to all generalized linear models whose transfer functions are \n\nlog-concave. \nan adjustable bias term is included in the inner product eT X. \n\n2Here we assume that the constant + 1 is appended to every feature vector X so that \n\n\fExploiting Generative Models in Discriminative Classifiers \n\n489 \n\nmaximizing the following penalized log-likelihood: \n\nI: log P(S, IX 1 , B) + log P(B) \n\nwhere the constant c does not depend on B. It is straightforward to show, simply \nby taking the gradient with respect to the parameters, that the solution to this \n(concave) maximization problem can be written as3 \n\n(4) \n\nXote that the coefficients A, appear as weights on the training examples as in the \ndefinition of the kernel methods. Indeed. inserting the above solution back into the \nconditional probability model gives \n\n(5) \n\nBy identifying !..:(X/. X) = X;'f.X and noting that the label with the maximum \nprobability is the aile that has the same sign as the sum in the argument. this gives \nthe decision rule (1). \n\nThrough the above derivation , we have written the primal parameters B in terms \nof the dual coefficients A,.J. Consequently. the penalized log-likelihood function can \nbe also written entirely in terms of A, : the resulting likelihood function specifies \nhow the coefficients are to be optimized. This optimization problem has a unique \nsolution and can be put into a generic form. Also , the form of the kernel function \nthat establishes the connection between the logistic regression model and a kernel \nclassifier is rather specific, i.e .. has the inner product form K(X,. X) = X;'f.X. \nHowever. as long as the examples here can be replaced with feature vectors derived \nfrom the examples. this form of the kernel function is the most general. \\Ve discuss \nthis further in the next section. \n\n3 The kernel function \n\nFor a general kernel fUIlction to be valid. roughly speaking it only needs to be pos(cid:173)\nitive semi-definite (see e.g. [7]). According to the t-Iercer 's theorem. any such valid \nkernel function admits a representation as a simple inner product bet\\\\'een suitably \ndefined feature vectors. i.e .. !\":(X,.Xj) = 0\\,0.'\\) . where the feature vectors come \nfrom some fixed mapping X -> \u00a2.'\\. For example. in the previous section the kernel \nfunction had the form X;'f.Xj ' which is a simple inner product for the transformed \nfeature vector \u00a2 .'\\ = 'f. 1- X. \n\nSpecifying it simple inner product in the feature space defines a Euclidean met(cid:173)\nric space. Consequently. the Euclidean distances between the feature vectors are \nobtained directly from the kernel fUllction: with the shorthand notation K ,} = \n\n3This corresponds to a Legendre transformation of the loss functions log a( z) . \n.}This is possible for all those e that could arise as solutions to the maximum penalized \n\nlikelihood problem: in other words. for all relevant e. \n\n\f490 \n\nT. S. Jaakkola and D. Haussler \n\nK(Xi , Xj) we get II.lathematics , 1990. \n\n\f", "award": [], "sourceid": 1520, "authors": [{"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "David", "family_name": "Haussler", "institution": null}]}