{"title": "Maximum Entropy Discrimination", "book": "Advances in Neural Information Processing Systems", "page_first": 470, "page_last": 476, "abstract": null, "full_text": "Maximum entropy discrimination \n\nTommi Jaakkola \n\nMIT AI Lab \n\n545 Technology Sq. \n\nMarina Meila \n\nMIT AI Lab \n\n545 Technology Sq. \n\nTony Jebara \nMIT Media Lab \n\n20 Ames St. \n\nCambridge, MA 02139 \n\nCambridge, MA 02139 \n\ntommi@ai.mit.edu \n\nmmp@ai. mit. edu \n\nCambridge, MA 02139 \njebara@media. mit. edu \n\nAbstract \n\nWe present a general framework for discriminative estimation based \non the maximum entropy principle and its extensions. All calcula(cid:173)\ntions involve distributions over structures and/or parameters rather \nthan specific settings and reduce to relative entropy projections. \nThis holds even when the data is not separable within the chosen \nparametric class, in the context of anomaly detection rather than \nclassification, or when the labels in the training set are uncertain or \nincomplete. Support vector machines are naturally subsumed un(cid:173)\nder this class and we provide several extensions. We are also able \nto estimate exactly and efficiently discriminative distributions over \ntree structures of class-conditional models within this framework. \nPreliminary experimental results are indicative of the potential in \nthese techniques. \n\n1 \n\nIntroduction \n\nEffective discrimination is essential in many application areas. Employing gener(cid:173)\native probability models such as mixture models in this context is attractive but \nthe criterion (e.g., maximum likelihood) used for parameter/structure estimation \nis suboptimal. Support vector machines (SVMs) are, for example, more robust \ntechniques as they are specifically designed for discrimination [9]. \n\nOur approach towards general discriminative training is based on the well known \nmaximum entropy principle (e.g., [3]). This enables an appropriate training of both \nordinary and structural parameters of the model (cf. [5, 7]). The approach is not \nlimited to probability models and extends, e.g., SVMs. \n\n2 Maximum entropy classification \n\nConsider a two-class classification problem1 where labels y E {-I, I} are assigned \n\nIThe extension to a multi-class is straightforward[4]. The formulation also admits an \n\neasy extension to regression problems, analogously to SVMs. \n\n\fMaximum Entropy Discrimination \n\n471 \n\nto examples X E X. Given two generative probability distributions P(XIOy ) with \nparameters Oy, one for each class, the corresponding decision rule follows the sign \nof the discriminant function: \n\nC(XI8) = log P(XIO-l) + b \n\nP(XIOl) \n\n(1) \n\nwhere 8 = {Ol,O-l,b} and b is a bias term, usually expressed as a log-ratio b = \nlog p/(l - p). The class-conditional distributions may come from different families \nof distributions or the parametric discriminant function could be specified directly \nwithout any reference to models. The parameters Oy may also include the model \nstructure (see later sections). \nThe parameters 8 = {01, 0-1, b} should be chosen to maximize classification accu(cid:173)\nracy. We consider here the more general problem of finding a distribution P(8) \nover parameters and using a convex combination of discriminant functions, i.e., \nJ P(8)C(XI8)d8 in the decision rule. The search for the optimal P(8) can be for(cid:173)\nmalized as a maximum entropy (ME) estimation problem. Given a set of training \nexamples {Xl, ... , X T} and corresponding labels {Yl, ... ,YT} we find a distribu(cid:173)\ntion P(8) that maximizes the entropy H(P) subject to the classification constraints \nJ P(8) [Yt C(Xt I8)] d8 2: , for all t. Here, > 0 specifies a desired classification \nmargin. The solution is unique (if it exists) since H(P) is concave and the linear \nconstraints specify a convex region. Note that the preference towards high entropy \ndistributions (fewer assumptions) applies only within the admissible set of distribu(cid:173)\ntions P'\"Y consistent with the constraints. See [2] for related work. \n\nWe will extend this basic idea in a number of ways. The ME formulation assumes, \nfor example, that the training examples can be separated with the specified mar(cid:173)\ngin. We may also have a reason to prefer some parameter values over others and \nwould therefore like to incorporate a prior distribution Po (8). Other extensions \nand generalizations will be discussed later in the paper. \n\nA more complete formulation is based on the following minimum relative entropy \nprinciple: \n\nDefinition 1 Let {Xt, yd be the training examples and labels, C(XI8) a parametric \ndiscriminant function, and, = [,1, ... \"tl a set of margin variables. Assuming a \nprior distribution Po(8,,), we find the discriminative minimum relative entropy \n(MRE) distribution P(8,,) by minimizing D(PIIPo) subject to \n\n(2) \n\nfor all t. Here fj = sign ( J P(8) C(XI8) d8) specifies the decision rule for any \nnew example X. \n\nThe margin constraints and the preference towards large margin solutions are encod(cid:173)\ned in the prior Po('). Allowing negative margin values with non-zero probabilities \nalso guarantees that the admissible set P consisting of distributions P(8,,) con(cid:173)\nsistent with the constraints, is never empty. Even when the examples cannot be \nseparated by any discriminant function in the parametric class (e.g., linear), we \nget a valid solution. The miss-classification penalties follow from Pob) as well. \n\n\f472 \n\nT. Jaakkola, M. Meila and T. Jebara \n\n.\u2022\u2022... \n\n..... \n\nb) -. \n\n-. \n\nc) -~:C--~-;----;-----.---:. \n\nFigure 1: a) Minimum relative entropy (MRE) projection from the prior distribution \nto the admissible set. b) The margin prior Po(Tt). c) The potential terms in the \nMRE formulation (solid line) and in SVMs (dashed line). c = 5 in this case. \n\nSuppose po(e , ,) = po(e)Po(T) and poe,) = Dt Po (Tt) , where \n\nPo(Tt) = ee-c(I-\"Yt) \n\nfor ,t ~ 1, \n\n(3) \n\nThis is shown in Figure lb. The penalty for margins smaller than I-lie (the prior \nmean of,t) is given by the relative entropy distance between P(T) and Po(T). This \nis similar but not identical to the use of slack variables in support vector machines. \nOther choices of the prior are discussed in [4]. \n\nThe MRE solution can be viewed as a relative entropy projection from the prior \ndistribution po(e,,) to the admissible set P . Figure la illustrates this view. From \nthe point of view of regularization theory, the prior probability Po specifies the \nentropic regularization used in this approach. \n\nTheorem 1 The solution to the MRE problem has the following general form [1] \n\npee,,) = ztA)Po(e,,) el: t At[Yt,C(xtle)-\"Y,] \n\n(4) \n\nwhere Z (A) is the normalization constant (partition function) and A = {AI, ... , AT} \ndefines a set of non-negative Lagrange multipliers, one for each classification con(cid:173)\nstraint. A are set by finding the unique maximum of the following jointly concave \nobjective function: J(A) = -logZ(A) \n\nThe solution is sparse, Le., only a few Lagrange mUltipliers will be non-zero. This \narises because many of the classification constraints become irrelevant once the \nconstraints are enforced for a small subset of examples. Sparsity leads to immediate \nbut weak generalization guarantees expressed in terms of the number of non-zero \nLagrange multipliers [4]. Practicalleave-one-out cross-validation estimates can be \nalso derived. \n\n2.1 Practical realization of the MRE solution \n\nWe now turn to finding the MRE solution. To begin with, we note that any disjoint \nfactorization of the prior Po (e, ,), where the corresponding parameters appear in \ndistinct additive components in YtC(Xt, e) - ,t, leads to a disjoint factorization of \nthe MRE solution pee, ,) . For example, {e \\ b, b, ,} provides such a factorization. \nAs a result of this factorization, the bias term could be eliminated by imposing \nadditional constraints on the Lagrange multipliers [4]. This is analogous to the \nhandling of the bias term in support vector machines [9]. \n\nWe consider now a few specific realizations such as support vector machines and a \nclass of graphical models. \n\n\fMaximum Entropy Discrimination \n\n473 \n\n2.1.1 Support vector machines \n\nIt is well known that the log-likelihood ratio of two Gaussian distributions with equal \ncovariance matrices yields a linear decision rule. With a few additional assumptions, \nthe MRE formulation gives support vector machines: \n\nTheorem 2 Assuming C(X, e) = OT X - band po(e, ,) = Po(O)Po(b)Po(,) where \nPo (0) is N (0,1), Po (b) approaches a non-informative prior, and Po (J) is given by \n\u00b0 ::; At ::; c and 2:t AtYt = 0, where \neq. (3) then the Lagrange multipliers A are obtained by maximizing J(A) subject to \n\nJ(A) = :~:) At + log(l - At/C)]- ~ 2:, AtAt'YtydX [ X t,) \n\n(5) \n\nt \n\nt,t' \n\nThe only difference between our J(A) and the (dual) optimization problem for \nSVMs is the additional potential term log(l - At/C). This highlights the effect \nof the different miss-classification penalties, which in our case come from the MRE \nprojection. Figure Ib shows, however, that the additional potential term does not \nalways carry a huge effect (for c = 5). Moreover, in the separable case, letting \nc -+ 00, the two methods coincide. The decision rules are formally identical. \n\nWe now consider the case where the discriminant function C(X, e) corresponds to \nthe log-likelihood ratio of two Gaussians with different (and adjustable) covariance \nmatrices. The parameters e in this case are both the means and the covariances. \nThe prior paCe) must be the conjugate Normal-Wishart to obtain closed form \nintegrals2 for the partition function, Z. Here, p(e l , e- l ) is P(m1' VdP(m-1, V-d, \na density over means and covariances. \nThe prior distribution has the form po(ed = N(m1; mo, Vdk) IW(V1; kVo, k) with \nparameters (k, mo, Vo) that can be specified manually or one may let k -+ 0 to get \na non-informative prior. Integrating over the parameters and the margin, we get \nZ = Z\"( X Zl X Z-l, where \n\n(6) \n\nT \n\n-\n\n-T \n\n. \n\n6 \n\nw \n\n.:l \n\n-.:l \n\nN1 = 2:t Wt, Xl = 2:t ~Xt, 3 1 = 2:t WtXtXt - N 1X I X 1 . Here, Wt IS a scalar \nweight given by Wt = u(Yt)+YtAt. For Z-l, the weights are set to Wt = u( -Yt)-YtAt; \nu(\u00b7) is the step function. Given Z, updating A is done by maximizing J(A). The \nresulting marginal MRE distribution over the parameters (normalized by Zl x Z-d \nis a Normal-Wishart distribution itself, p(e 1 ) = N(m1; Xl, VdNd IW(V1; 3 1 , N 1) \nwith the final A values. Predicting the label for a new example X involves taking \nexpectations of the discriminant function under a Normal-Wishart. This is \n\nWe thus obtain discriminative quadratic decision boundaries. These extend the \nlinear boundaries without (explicitly) resorting to kernels. More generally, the \ncovariance estimation in this framework adaptively modifies the kernel. \n\n2This can be done more generally for conjugate priors in the exponential family. \n\n\f474 \n\nT. Jaakkola, M Meila and T. Jebara \n\n2.1.2 Graphical models \n\nWe consider here graphical models with no hidden variables. The ME (or MRE) \ndistribution is in this case a distribution over both structures and parameters. Find(cid:173)\ning the distribution over parameters can be done in closed form for conjugate priors \nwhen the observations are complete. The distribution over structures is, in general, \nintractable. A notable exception is a tree model that we discuss in the forthcoming . \n\nA tree graphical model is a graphical model for which the structure is a tree. This \nmodel has the property that its log-likelihood can be expressed as a sum of local \nterms [8] \n\nlogP(X,EIO) = 2: hu(X, 0) + 2: wuv(X,O) \n\nu \n\nuvEE \n\n(8) \n\nThe discriminant function consisting of the log-likelihood ratio of a pair of tree \nmodels (depending on the edge sets E 1, E_l, and parameters 01, 0_ 1) can be also \nexpressed in this form . \n\nWe consider here the ME distribution over tree structures for fixed parameters3 . \nThe treatment of the general case (i.e. including the parameters) is a direct exten(cid:173)\nsion of this result. The ME distribution over the edge sets E1 and E-1 factorizes \nwith components \n\nP(E\u00b1l) = _1_ e\u00b12:,).\"Yt[2:uvEE\u00b11 w;!'v1(X\"O\u00b1I)+2:u hU(X\"O\u00b1I\u00bb) = h\u00b11 IT W~1 (9) \n\nZ\u00b11 \n\nZ\u00b11 EE \n\nuv \n\n\u00b11 \n\nwhere Z\u00b11, h\u00b1l, W\u00b11 are functions of the same Lagrange multipliers..\\. To com(cid:173)\npletely define the distribution we need to find ..\\ that optimize J(..\\) in Theorem 1; \nfor classification we also need to compute averages with respect to P(E\u00b1d. For \nthese, it suffices to obtain an expression of the partition function( s) Z\u00b11. \n\nP is a discrete distribution over all possible tree structures for n variables (there \nare nn-2 trees). However, a remarkable graph theory result, called the Matrix Tree \nTheorem [10], enables us to perform all necessary summations in closed form in \npolynomial time. On the basis of this result, we find \n\nTheorem 3 The normalization constant Z of a distribution of the form (9) is \n\nZ \n\nQuv(W) \n\nh.2: IT Wuv = h 'IQ(W)I, where \n\nE uvEE \n\n{ -Wuv \n\n2:~'=l WV'v \n\nu=f:.v \nu=v \n\n(10) \n\n(11) \n\nThis shows that summing over the distribution of all trees, when this distribution \nfactors according to the trees' edges, can be done in closed form by computing the \nvalue of a determinant in time O(n3 ). Since we obtain a closed form expression, \noptimization of the Lagrange multipliers and evaluating the resulting classification \nrule are also tractable. \n\nFigure 2a provides a comparison of the discriminative tree approach and a maximum \nlikelihood tree estimation method on a DNA splice junction problem. \n\n3Each tree relies on a different set of n -1 pairwise node marginals. In our experiments \n\nthe class-conditional pairwise marginals were obtained directly from data. \n\n\fMaximum Entropy Discrimination \n\n475 \n\nt\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \n\n~:: ,//._ .. : \n8:: : \n2\" -.. \n\nCD \n\na) \n\n.. \"false pOsitives\" .. . b) \n\n\u00b00 \n\n02 \n\n04 --\n\n04 \n\n04 \n\no. \n\nc) \n\n000-----=0 .-=-\n\n. ----=-... :---::\":0.C---::\":0 .\u2022 ---' \n\n.... -\n\nFigure 2: ROC curves based on independent test sets. a) Tree estimation: discrim(cid:173)\ninative (solid) and ML (dashed) trees. b) Anomaly detection: MRE (solid) and \nBayes (dashed) . c) Partially labeled case: 100% labeled (solid), 10% labeled + 90% \nunlabeled (dashed), and 10% labeled + 0% unlabeled training examples (dotted). \n\n3 Extensions \n\nAnomaly detection: In anomaly detection we are given a set of training ex(cid:173)\namples representing only one class, the \"typical\" examples. We attempt to cap(cid:173)\nture regularities among the examples to be able to recognize unlikely members \nof this class. Estimating a probability distribution P(XIO) on the basis of the \ntraining set {Xl, \" . , X T} via the ML (or analogous) criterion is not appropriate; \nthere is no reason to further increase the probability of those examples that are al(cid:173)\nready well captured by the model. A more relevant measure involves the level sets \nX)' = {X EX: log P(X 10) 2:: ,} which are used in deciding the class membership \nin any case. We estimate the parameters 0 to optimize an appropriate level set. \n\nDefinition 2 Given a probability model P(XIO), 0 E e, a set of training examples \n{X 1, ... , X T }, a set of margin variables , = bl, ... , ,T], and a prior distribution \nPo(O, ,) we find the MRE distribution P(O, ,) such that minimizes D(PIIPo) subject \nto the constraints J P(O, ,) [log P(XtIO) - ,t] dOd, 2:: 0 for all t = 1, ... ,T. \n\nNote that this again a MRE projection whose solution can be obtained as before. \nThe choice of Pob) in Po(O, ,) = Po (O)Po b) is not as straightforward as before since \neach margin ,t needs to be close to achievable log-probabilities. We can nevertheless \nfind a reasonable choice by relating the prior mean of ,t to some a-percentile of the \ntraining set log-probabilities generated through ML or other estimation criterion. \nDenote the resulting value by la and define the prior Pobt) as Pobt) = ee -c (l ,, -),.) \nfor,t ::; lao In this case the prior mean of,t is la -\n\nlie. \n\nFigure 2b shows in the context of a simple product distribution that this choice of \nprior together with the MRE framework leads to a real improvement over standard \n(Bayesian) approach. We believe, however, that the effect will be more striking \nfor sophisticated models such as HMMs that may otherwise easily capture spurious \nregularities in the data. An extension of this formalism to latent variable models is \nprovided in [4]. \n\nUncertain or incompletely labeled examples: Examples with uncertain la(cid:173)\nbels are hard to deal with in any (probabilistic or not) discriminative classification \nmethod. Uncertain labels can be, however, handled within the maximum entropy \nformalism: let Y = {Yl , ' .. , YT} be a set of binary variables corresponding to the \nlabels for the training examples. We can define a prior uncertainty over the labels \nby specifying Po(Y) ; for simplicity, we can take this to be a product distribution \n\n\f476 \n\nT. Jaakkola, M Meila and T. Jebara \n\nPo{Y) = TIt Pt,o(Yt) where a different level of uncertainty can be assigned to each \nexample. Consequently, we find the minimum relative entropy projection from the \nprior distribution po(e\", y) = po{e)Po([)Po(Y) to the admissible set of distribu(cid:173)\ntions (no longer a function of the labels) that are consistent with the constraints: \nE y fe ,,\"( p(e\", y) [YtC(Xt, e) -,tl de d, ~ 0 for all t = 1, ... , T. The MRE \nprinciple differs from transduction [9], provides a soft rather than hard assignment \nof unlabeled examples, and is fundamentally driven by large margin classification. \nThe MRE solution is not, however, often feasible to obtain in practice. We can \nnevertheless formulate an efficient mean field approach in this context [4]. Figure \n2c demonstrates that even the approximate method is able to reap most of the ben(cid:173)\nefit from unlabeled examples (compare, e.g., [6]). The results are for a DNA splice \njunction classification problem. For more details see [4]. \n\n4 Discussion \n\nWe have presented a general approach to discriminative training of model param(cid:173)\neters, structures, or parametric discriminant functions. The formalism is based on \nthe minimum relative entropy principle reducing all calculations to relative entropy \nprojections. The idea naturally extends beyond standard classification and cov(cid:173)\ners anomaly detection, classification with partially labeled examples, and feature \nselection. \n\nReferences \n\n[1] Cover and Thomas (1991). Elements of information theory. John Wiley & Sons. \n\n[2] Kivinen J. and Warmuth M. (1999). Boosting as Entropy Projection. Proceed(cid:173)\n\nings of the 12th Annual Conference on Computational Learning Theory. \n\n[3] Levin and Tribus (eds.) (1978). The maximum entropy formalism. Proceedings \n\nof the Maximum entropy formalism conference, MIT. \n\n[4] Jaakkola T., Meila M. and Jebara T. (1999). Maximum entropy discrimination. \n\nMIT AITR-1668, http://www.ai .mit. edu;-tommi/papers .html. \n\n[5] Jaakkola T. and Haussler D. (1998). Exploiting generative models in discrimi(cid:173)\n\nnative classifiers. NIPS 11. \n\n[6] Joachims, T. (1999). Transductive inference for text classification using support \n\nvector machines. International conference on Machine Learning. \n\n[7] Jebara T. and Pentland A. (1998). Maximum conditional likelihood via bound \n\nmaximization and the CEM algorithm. NIPS 11. \n\n[8] Meila M. and Jordan M. (1998). Estimating dependency structure as a hidden \n\nvariable. NIPS 11. \n\n[9] Vapnik V. (1998). Statistical learning theory. John Wiley & Sons. \n\n[10] West D. (1996). Introduction to graph theory. Prentice Hall. \n\n\f", "award": [], "sourceid": 1733, "authors": [{"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}, {"given_name": "Marina", "family_name": "Meila", "institution": null}, {"given_name": "Tony", "family_name": "Jebara", "institution": null}]}