{"title": "On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes", "book": "Advances in Neural Information Processing Systems", "page_first": 841, "page_last": 848, "abstract": null, "full_text": "On Discriminative vs. Generative \nclassifiers: A comparison of logistic \n\nregression and naive Bayes \n\nAndrew Y. Ng \n\nComputer Science Division \n\nMichael I. Jordan \n\nC.S. Div. & Dept. of Stat. \n\nUniversity of California, Berkeley \n\nUniversity of California, Berkeley \n\nBerkeley, CA 94720 \n\nBerkeley, CA 94720 \n\nAbstract \n\nWe compare discriminative and generative learning as typified by \nlogistic regression and naive Bayes. We show, contrary to a widely(cid:173)\nheld belief that discriminative classifiers are almost always to be \npreferred, that there can often be two distinct regimes of per(cid:173)\nformance as the training set size is increased, one in which each \nalgorithm does better. This stems from the observation- which \nis borne out in repeated experiments-\nthat while discriminative \nlearning has lower asymptotic error, a generative classifier may also \napproach its (higher) asymptotic error much faster. \n\n1 \n\nIntroduction \n\nGenerative classifiers learn a model of the joint probability, p( x, y), of the inputs x \nand the label y, and make their predictions by using Bayes rules to calculate p(ylx), \nand then picking the most likely label y. Discriminative classifiers model the pos(cid:173)\nterior p(ylx) directly, or learn a direct map from inputs x to the class labels. There \nare several compelling reasons for using discriminative rather than generative clas(cid:173)\nsifiers, one of which, succinctly articulated by Vapnik [6], is that \"one should solve \nthe [classification] problem directly and never solve a more general problem as an \nintermediate step [such as modeling p(xly)].\" Indeed, leaving aside computational \nissues and matters such as handling missing data, the prevailing consensus seems \nto be that discriminative classifiers are almost always to be preferred to generative \nones. \nAnother piece of prevailing folk wisdom is that the number of examples needed to \nfit a model is often roughly linear in the number of free parameters of a model. \nThis has its theoretical basis in the observation that for \"many\" models, the VC \ndimension is roughly linear or at most some low-order polynomial in the number \nof parameters (see, e.g., [1, 3]), and it is known that sample complexity in the \ndiscriminative setting is linear in the VC dimension [6]. \nIn this paper, we study empirically and theoretically the extent to which these \nbeliefs are true. A parametric family of probabilistic models p(x, y) can be fit either \nto optimize the joint likelihood of the inputs and the labels, or fit to optimize the \nconditional likelihood p(ylx), or even fit to minimize the 0-1 training error obtained \n\n\fby thresholding p(ylx) to make predictions. Given a classifier hGen fit according \nto the first criterion, and a model hDis fit according to either the second or the \nthird criterion (using the same parametric family of models) , we call hGen and \nhDis a Generative-Discriminative pair. For example, if p(xly) is Gaussian and p(y) \nis multinomial, then the corresponding Generative-Discriminative pair is Normal \nDiscriminant Analysis and logistic regression. Similarly, for the case of discrete \ninputs it is also well known that the naive Bayes classifier and logistic regression \nform a Generative-Discriminative pair [4, 5]. \nTo compare generative and discriminative learning, it seems natural to focus on \nsuch pairs. In this paper, we consider the naive Bayes model (for both discrete and \ncontinuous inputs) and its discriminative analog, logistic regression/linear classifi(cid:173)\ncation, and show: (a) The generative model does indeed have a higher asymptotic \nerror (as the number of training examples becomes large) than the discriminative \nmodel, but (b) The generative model may also approach its asymptotic error much \nfaster than the discriminative model- possibly with a number of training examples \nthat is only logarithmic, rather than linear, in the number of parameters. This \nsuggests-and our empirical results strongly support-that, as the number of train(cid:173)\ning examples is increased, there can be two distinct regimes of performance, the \nfirst in which the generative model has already approached its asymptotic error and \nis thus doing better, and the second in which the discriminative model approaches \nits lower asymptotic error and does better. \n\n2 Preliminaries \n\nWe consider a binary classification task, and begin with the case of discrete data. \nLet X = {O, l}n be the n-dimensional input space, where we have assumed binary \ninputs for simplicity (the generalization offering no difficulties). Let the output \nlabels be Y = {T, F}, and let there be a joint distribution V over X x Y from which \na training set S = {x(i) , y(i) }~1 of m iid examples is drawn. The generative naive \nBayes classifier uses S to calculate estimates p(xiIY) and p(y) of the probabilities \np(xi IY) and p(y), as follows: \n\nP' (x- = 11Y = b) = #s{xi=l ,y=b}+1 \n\n#s{y-b}+21 \n\n, \n\n(and similarly for p(y = b),) where #s{-} counts the number of occurrences of an \n\nevent in the training set S. Here, setting l = \u00b0 corresponds to taking the empirical \n\nestimates of the probabilities, and l is more traditionally set to a positive value such \nas 1, which corresponds to using Laplace smoothing of the probabilities. To classify \na test example x, the naive Bayes classifier hGen : X r-+ Y predicts hGen(x) = T if \nand only if the following quantity is positive: \n\n(1) \n\n(2) \n\nIGen(x ) = log (rrn \n\n(rr~-d)(xi ly = T))p(y = T) ~ p(xilY = T) \n\np(y = T) \n' ( _I _ F)) '( _ F) = L..,log ' ( _I _ F) + log ' ( _ F)' \nP Y -\n\ni=1 P X, Y -\n\nP X, Y -\n\nP Y -\n\ni=1 \n\nIn the case of continuous inputs , almost everything remains the same, except that \nwe now assume X = [O,l]n, and let p(xilY = b) be parameterized as a univariate \nGaussian distribution with parameters {ti ly=b and if; (note that the j1's, but not \nthe if's , depend on y). The parameters are fit via maximum likelihood, so for \nexample {ti ly=b is the empirical mean of the i-th coordinate of all the examples in \nthe training set with label y = b. Note that this method is also equivalent to Normal \nDiscriminant Analysis assuming diagonal covariance matrices. In the sequel, we also \nlet J.tily=b = E[Xi IY = b] and a; = Ey[Var(xi ly)] be the \"true\" means and variances \n(regardless of whether the data are Gaussian or not). \nIn both the discrete and the continuous cases, it is well known that the discrimina(cid:173)\ntive analog of naive Bayes is logistic regression. This model has parameters [,8, OJ, \nand posits that p(y = Tlx; ,8, O) = 1/(1 +exp(-,8Tx - 0)). Given a test example x, \n\n\fthe discriminative logistic regression classifier hois : X I-t Y predicts hOis (x) = T if \nand only if the linear discriminant function \n\nlDis(x) = L~=l (3ixi + () \n\n(3) \nis positive. Being a discriminative model, the parameters [(3, ()] can be fit either to \nmaximize the conditionallikelikood on the training set L~=llogp(y(i) Ix(i); (3, ()), or \nto minimize 0-1 training error L~= ll{hois(x(i)) 1- y(i)}, where 1{-} is the indicator \nfunction (I{True} = 1, I{False} = 0) . Insofar as the error metric is 0-1 classification \nerror, we view the latter alternative as being more truly in the \"spirit\" of discrim(cid:173)\ninative learning, though the former is also frequently used as a computationally \nefficient approximation to the latter. In this paper, we will largely ignore the differ(cid:173)\nence between these two versions of discriminative learning and, with some abuse of \nterminology, will loosely use the term \"logistic regression\" to refer to either, though \nour formal analyses will focus on the latter method. \nFinally, let 1i be the family of all linear classifiers (maps from X to Y); and given a \nclassifier h : X I-t y, define its generalization error to be c(h) = Pr(x,y)~v [h(x) 1- y]. \n\n3 Analysis of algorithms \n\nWhen V is such that the two classes are far from linearly separable, neither logistic \nregression nor naive Bayes can possibly do well, since both are linear classifiers. \nThus, to obtain non-trivial results, it is most interesting to compare the performance \nof these algorithms to their asymptotic errors (cf. the agnostic learning setting). \nMore precisely, let hGen,oo be the population version of the naive Bayes classifier; i.e. \nhGen,oo is the naive Bayes classifier with parameters p(xly) = p(xly),p(y) = p(y). \nSimilarly, let hOis,oo be the population version of logistic regression. The following \ntwo propositions are then completely straightforward. \n\nProposition 1 Let hGen and hDis be any generative-discriminative pair of clas(cid:173)\nsifiers, and hGen,oo and hois,oo be their asymptotic/population versions. Thenl \nc(hDis,oo) :S c(hGen,oo). \nProposition 2 Let hDis be logistic regression in n-dimensions. Then with high \nprobability \n\nc(hois ) :S c(hois,oo) + 0 (J ~ log ~) \n\nThus, for c(hOis) :S c(hOis,oo) + EO to hold with high probability (here, EO > 0 is some \nfixed constant), it suffices to pick m = O(n). \n\nProposition 1 states that aymptotically, the error of the discriminative logistic re(cid:173)\ngression is smaller than that of the generative naive Bayes. This is easily shown \nby observing that, since c(hDis) converges to infhE1-l c(h) (where 1i is the class of \nall linear classifiers), it must therefore be asymptotically no worse than the linear \nclassifier picked by naive Bayes. This proposition also provides a basis for what \nseems to be the widely held belief that discriminative classifiers are better than \ngenerative ones. \nProposition 2 is another standard result, and is a straightforward application of \nVapnik's uniform convergence bounds to logistic regression, and using the fact that \n1i has VC dimension n. The second part of the proposition states that the sample \ncomplexity of discriminative learning-\nthat is, the number of examples needed to \nis at most on the order of n. Note that the worst \napproach the asymptotic error-\ncase sample complexity is also lower-bounded by order n [6]. \n\nlUnder a technical assumption (that is true for most classifiers, including logistic re(cid:173)\n\ngression) that the family of possible classifiers hOis (in the case of logistic regression, this \nis 1l) has finite VC dimension. \n\n\fThe picture for discriminative learning is thus fairly well-understood: The error \nconverges to that of the best linear classifier, and convergence occurs after on the \norder of n examples. How about generative learning, specifically the case of the \nnaive Bayes classifier? We begin with the following lemma. \n\nLemma 3 Let any 101,8 > \u00b0 and any l 2: \u00b0 be fixed. Assume that for some fixed \nPo > 0, we have that Po :s: p(y = T) :s: 1 - Po. Let m = 0 ((l/Ei) log(n/8)). Then \nwith probability at least 1 - 8: \n\n1. In case of discrete inputs, IjJ(XiIY = b) - p(xilY = b)1 \n\nb) - p(y = b) I :s: 101, for all i = 1, ... ,n and bEY. \n\n:s: 101 and IjJ(y = \n\n2. In the case of continuous inputs, IPily=b - f-li ly=b I :s: 101, laT - O\"T I :s: 101, and \nProof (sketch). Consider the discrete case, and let l = \u00b0 for now. Let 101 :s: po/2. \n\nIjJ(y = b) - p(y = b) I :s: 101 for all i = 1, ... ,n and bEY. \n\nBy the Chernoff bound, with probability at least 1 - 81 = 1- 2exp(-2Eim) , the \nfraction of positive examples will be within 101 of p(y = T) , which implies IjJ(y = \nb) - p(y = b)1 :s: 101, and we have at least 1 m positive and 1m negative examples, \nwhere I = Po - 101 = 0(1). So by the Chernoff bound again, for specific i, b, the \nchance that IjJ(XiIY = b) - p(xilY = b)1 > 101 is at most 82 = 2exp(-2Ehm). Since \nthere are 2n such probabilities, the overall chance of error, by the Union bound, is \nat most 81 + 2n82 . Substituting in 81 and 8/s definitions, we see that to guarantee \n81 + 2n82 :s: 8, it suffices that m is as stated. Lastly, smoothing (l > 0) adds at most \na small, O(l/m) perturbation to these probabilities , and using the same argument \nas above with (say) 101/2 instead of 101, and arguing that this O(l/m) perturbation \nis at most 101/2 (which it is as m is at least order l/Ei) , again gives the result. The \nresult for the continuous case is proved similarly using a Chernoff-bounds based \nargument (and the assumption that Xi E [0,1]). \nD \nThus, with a number of samples that is only logarithmic, rather than linear, in n, the \nparameters of the generative classifier hGen are uniformly close to their asymptotic \nvalues in hGen ,oo . Is is tempting to conclude therefore that c(hGen), the error of the \ngenerative naive Bayes classifier, also converges to its asymptotic value of c(hGen,oo) \nafter this many examples, implying only 0 (log n) examples are required to fit a \nnaive Bayes model. We will shortly establish some simple conditions under which \nthis intuition is indeed correct. Note that this implies that, even though naive Bayes \nconverges to a higher asymptotic error of c(hGen,oo) compared to logistic regression's \nc: (hDis,oo ), it may also approach it significantly faster-after O(log n), rather than \nO(n), training examples. \nOne way of showing c(hGen) approaches c(hGen,oo) is by showing that the parame(cid:173)\nters' convergence implies that hGen is very likely to make the same predictions as \nhGen,oo . Recall hGen makes its predictions by thresholding the discriminant func(cid:173)\ntion lGen defined in (2). Let lGen,oo be the corresponding discriminant function \nused by hGen,oo. On every example on which both lGen and lGen,oo fall on the same \nside of zero, hGen and hGen,oo will make the same prediction. Moreover, as long as \nlGen,oo (x) is, with fairly high probability, far from zero, then lGen (x), being a small \nperturbation of lGen,oo(x), will also be usually on the same side ofzero as lGen,oo (x). \nTheorem 4 Define G(T) = Pr(x,y)~v[(lGen ,oo(x) E [O,Tn] A y = T) V (lG en,oo(X) E \n[-Tn, O]A Y = F)]. Assume that for some fixed Po > 0, we have Po :s: p(y = T) :s: \n:s: P(Xi = 11Y = b) :s: 1 - Po for all i, b (in the case of \n1 - Po, and that either Po \ndiscrete inputs), or O\"T 2: Po (in the continuous case). Then with high probability, \n\nc:(hGen ) :s: c:(hGen,oo) + G (0 (J ~ logn)) . \n\n(4) \n\nc(hGen) - c(hGen,oo) is upperbounded by the chance that \nProof (sketch). \nhGen,oo correctly classifies a randomly chosen example, but hGen misclassifies it. \n\n\fLemma 3 ensures that, with high probability, all the parameters of hGen are within \nO( j(log n)/m) of those of hGen,oo . This in turn implies that everyone of the n + 1 \nterms in the sum in lGen (as in Equation 2) is within O( j(1ogn)/m) of the corre(cid:173)\nsponding term in lGen,oo , and hence that IlGen(x) -lGen,oo(x)1 :S O(nj(1ogn)/m). \nLetting T = O( j(logn)/m), we therefore see that it is possible for hGen,oo to be cor(cid:173)\nrect and hGen to be wrong on an example (x , y) only if y = T and lGen,oo(X) E [0, Tn] \n:S 0), or if y = F and \n(so that it is possible that lGen,oo(X) \nlGen,oo(X) E [-Tn, 0]. The probability of this is exactly G(T), which therefore up(cid:173)\nperbounds c(hGen) - c(hGen,oo ). \nD \nThe key quantity in the Theorem is the G(T) , which must be small when T is \nsmall in order for the bound to be non-trivial. Note G(T) is upper-bounded by \nPrx[lGen,oo(x) E [-Tn, Tn]]-the chance that lGen,oo(X) (a random variable whose \ndistribution is induced by x \"\"' V) falls near zero. To gain intuition about the scaling \nof these random variables, consider the following: \n\nlGen (x) \n\n::::: 0, \n\n::::: \n\nProposition 5 Suppose that, for at least an 0(1) fraction of the features i (i = \n1, ... ,n), it holds true that IP(Xi = 11Y = T) - P(Xi = 11Y = F)I ::::: \n'Y for some \nfixed'Y > 0 (or IJLi ly=T - JLi ly=FI \n'Y in the case of continuous inputs). Then \nE[lGen ,oo(x)ly = T] = O(n), and -E[lGen,oo (x)ly = F] = O(n). \nThus, as long as the class label gives information about an 0(1) fraction of the \nfeatures (or less formally, as long as most of the features are \"relevant\" to the class \nlabel), the expected value of IlGen, oo(X) I will be O(n). The proposition is easily \nproved by showing that, conditioned on (say) the event y = T, each of the terms \nin the summation in lGen,oo(x) (as in Equation (2), but with fi's replaced by p's) \nhas non-negative expectation (by non-negativity of KL-divergence), and moreover \nan 0(1) fraction of them have expectation bounded away from zero. \nProposition 5 guarantees that IlGen,oo (x)1 has large expectation, though what we \nwant in order to bound G is actually slightly stronger, namely that the random \nvariable IlGen,oo (x)1 further be large/far from zero with high probability. There \nare several ways of deriving sufficient conditions for ensuring that G is small. One \nway of obtaining a loose bound is via the Chebyshev inequality. For the rest of \nthis discussion, let us for simplicity implicitly condition on the event that a test \nexample x has label T. The Chebyshev inequality implies that Pr[lGen ,oo(x) :S \nE[lGen ,oo(X)] - t] :S Var(lGen,oo(x))/t2 . Now, lGen,oo (X) is the sum of n random \nvariables (ignoring the term involving the priors p(y)). If (still conditioned on y), \nthese n random variables are independent (i.e. if the \"naive Bayes assumption,\" \nthat the xi's are conditionally independent given y, holds), then its variance is O(n); \neven if the n random variables were not completely independent, the variance may \nstill be not much larger than 0 (n) (and may even be smaller, depending on the \nsigns of the correlations), and is at most O(n2). So, if E[lGen,oo (x)ly = T] = an (as \nwould be guaranteed by Proposition 5) for some a > 0, by setting t = (a - T)n, \nChebyshev's inequality gives Pr[lGen,oo(x) :S Tn] :S O(l/(a - T)2n1/) (T < a), where \n1} = 0 in the worst case, and 1} = 1 in the independent case. This thus gives \na bound for G(T), but note that it will frequently be very loose. Indeed, in the \nunrealistic case in which the naive Bayes assumption really holds, we can obtain \nthe much stronger (via the Chernoff bound) G(T):S exp(-O((a - T)2n)) , which is \nexponentially small in n. In the continuous case, if lGen,oo (x) has a density that, \nwithin some small interval [-m,mJ, is uniformly bounded by O(l/n), then we also \nhave G(T) = O(T). In any case, we also have the following Corollary to Theorem 4. \nCorollary 6 Let the conditions of Theorem 4 hold, and suppose that G(T) :S Eo/2+ \nF(T) for some function F(T) (independent of n) that satisfies F(T) -+ 0 as T -+ 0, \n:S c(hGen,oo) + EO to hold with high \nand some fixed EO > O. Then for \u20ac(hGen) \n\n\fpima (continuous) \n\nadult (continuous) \n\n0.5 \n\n0.3 \n\n0.250 \n\n0.45 \" \n0.4 \n\ne \n;;; \n0.35 \n\n~-2~~~, ____ \n\n'--, \n\n0.5 \n\n0.45 \n\n0.4 \n\ngO.35 \n~ \n\n0.3 \n\n0.25 \n\nboston (predict it > median price, continuous) \n0.45, - - - - - - - - - - - - - , \n\n0.4 \n\nI::: \\ ... ~~ __ \n\n20 \n\n40 \n\n60 \n\n0\"0 \n\n10 \n\n20 \n\n30 \n\n02Q- - -\"\"2\"'0- - -4\"'0-----\"cJ60 \n\noptdigits (O's and 1 's, continuous) \n\no.4,---- - - - - - - - - - , \n\noptdigits (2's and 3's, continuous) \n\n0.4\", - - - - - - - - - ----. \n\nionosphere (continuous) \n\n0.5,---- - - - - - - - - - - - - - , \n\n0.3 \n\n0.1 \n\n0.3 \n\n~0.2 \n01 ~ \n\n50 \n\n100 \n\n150 \n\n200 \n\nliver disorders (continuous) \n\n0.5, - - - - - - - - - - - - - - - - - , \n\n0.2 \n\n0.7,---- - - - - - - - - - - - - - , \n\nadult (discrete) \n\n0.6 \n\n0.45 \n\n~ \n\n0.4 \n\n0.350 \n\n0.5 \n\n~ 0.4 \n~ \n0.35 \n\n0.3 \n\n0.250 \n\n20 \n\n40 \n\n60 \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\n120 \n\n100 \n\n200 \n\n300 \n\n400 \n\npromoters (discrete) \n\nlymphography (discrete) \n\n0.5 \n\nbreast cancer (discrete) \n\n~:: ~ ......\u2022...\u2022..\u2022\u2022.\u2022.\u2022..\u2022.\u2022..\u2022.\u2022.\u2022.\u2022.\u2022...\u2022...\u2022. \u2022.. _. \n\n0.2 \n\n~ \n\n,-\n\n~ 0.4 ~'\\~.~:::.:~ \u2022...\u2022 \n\n..... \n\n~0 . 3 \n\n\"'\" \n\n0.1 \n\n0.2 \n\n..... \".\"\"\"\"\",,. \n\n%L--~20~-~40---'-6~0---'---'8~0-~100 \n\n0.10'------,5~0---.,.10~0---.---,J \n150 \n\nlenses (predict hard vs. soft, discrete) \n\n0.8, - - - - - - - - - - - - - , \n\nsick (discrete) \n\n0'6\\=~_ \n\ngOA \n~ \n\n0.2 \n\n..... \n\n0.5 \n\n0.4 \n\n0.2 \n\n0.45 \n\n~ 0.4 \n\n~ 0.35 \n\n0.3 \n\n0.250 \n\n0.4 \n\n0.3 \n\ngO.2 \n~ \n\n0.1 \\--. \n\n100 \n\n200 \n\n300 \n\nvoting records (discrete) \n\n0.10'---~-~10c--~15--2~0-~25 \n\n%~----,5~0---~10~0~--,,! 150 \n\n00 \n\n20 \n\n40 \n\n60 \n\n80 \n\n------------ -- --- --- --- -\n\nFigure 1: Results of 15 experiments on datasets from the VCI Machine Learning \nrepository. Plots are of generalization error vs. m (averaged over 1000 random \ntrain/test splits). Dashed line is logistic regression; solid line is naive Bayes. \n\n\fprobability, it suffices to pick m = O(log n). \n\nNote that the previous discussion implies that the preconditions of the Corollary \ndo indeed hold in the case that the naive Bayes (and Proposition 5's) assumption \nholds, for any constant fa so long as n is large enough that fa ::::: exp( -O(o:2n)) \n(and similarly for the bounded Var(lGen,oo (x)) case, with the more restrictive fa ::::: \nO(I/(o:2n17))). This also means that either ofthese (the latter also requiring T) > 0) \nis a sufficient condition for the asymptotic sample complexity to be 0 (log n). \n\n4 Experiments \n\nThe results of the previous section imply that even though the discriminative logis(cid:173)\ntic regression algorithm has a lower asymptotic error, the generative naive Bayes \nclassifier may also converge more quickly to its (higher) asymptotic error. Thus, as \nthe number of training examples m is increased, one would expect generative naive \nBayes to initially do better, but for discriminative logistic regression to eventually \ncatch up to, and quite likely overtake, the performance of naive Bayes. \nTo test these predictions, we performed experiments on 15 datasets, 8 with contin(cid:173)\nuous inputs, 7 with discrete inputs, from the VCI Machine Learning repository.2 \nThe results ofthese experiments are shown in Figure 1. We find that the theoretical \npredictions are borne out surprisingly well. There are a few cases in which logistic \nregression's performance did not catch up to that of naive Bayes, but this is observed \nprimarily in particularly small datasets in which m presumably cannot grow large \nenough for us to observe the expected dominance of logistic regression in the large \nm limit. \n\n5 Discussion \n\nEfron [2] also analyzed logistic regression and Normal Discriminant Analysis (for \ncontinuous inputs) , and concluded that the former was only asymptotically very \nslightly (1/3- 1/2 times) less statistically efficient. This is in marked contrast to our \nresults, and one key difference is that, rather than assuming P(xly) is Gaussian with \na diagonal covariance matrix (as we did), Efron considered the case where P(xly) is \nmodeled as Gaussian with a full convariance matrix. In this setting, the estimated \ncovariance matrix is singular if we have fewer than linear in n training examples, so \nit is no surprise that Normal Discriminant Analysis cannot learn much faster than \nlogistic regression here. A second important difference is that Efron considered \nonly the special case in which the P(xly) is truly Gaussian. Such an asymptotic \ncomparison is not very useful in the general case, since the only possible conclu(cid:173)\nsion, if \u20ac(hDis,oo) < \u20ac(hGen,oo), \nis that logistic regression is the superior algorithm. \nIn contrast, as we saw previously, it is in the non-asymptotic case that the most \ninteresting \"two-regime\" behavior is observed. \nPractical classification algorithms generally involve some form of regularization-\nin \nparticular logistic regression can often be improved upon in practice by techniques \n\n2To maximize the consistency with the theoretical discussion, these experiments avoided \ndiscrete/continuous hybrids by considering only the discrete or only the continuous-valued \ninputs for a dataset where necessary. Train/test splits were random subject to there being \nat least one example of each class in the training set, and continuous-valued inputs were also \nrescaled to [0, 1] if necessary. In the case of linearly separable datasets, logistic regression \nmakes no distinction between the many possible separating planes. In this setting we used \nan MCMC sampler to pick a classifier randomly from them (i.e., so the errors reported \nare empirical averages over the separating hyperplanes) . Our implementation of Normal \nDiscriminant Analysis also used the (standard) trick of adding \u20ac \nto the diagonal of the \ncovariance matrix to ensure invertibility, and for naive Bayes we used I = 1. \n\n\fsuch as shrinking the parameters via an L1 constraint, imposing a margin constraint \nin the separable case, or various forms of averaging. Such regularization techniques \ncan be viewed as changing the model family, however, and as such they are largely \northogonal to the analysis in this paper, which is based on examining particularly \nclear cases of Generative-Discriminative model pairings. By developing a clearer \nunderstanding of the conditions under which pure generative and discriminative \napproaches are most successful, we should be better able to design hybrid classifiers \nthat enjoy the best properties of either across a wider range of conditions. \nFinally, while our discussion has focused on naive Bayes and logistic regression, it is \nstraightforward to extend the analyses to several other models, including generative(cid:173)\ndiscriminative pairs generated by using a fixed-structure, bounded fan-in Bayesian \nnetwork model for P(xly) (of which naive Bayes is a special case). \n\nAcknowledgments \nWe thank Andrew McCallum for helpful conversations. A. Ng is supported by a \nMicrosoft Research fellowship. This work was also supported by a grant from Intel \nCorporation, NSF grant IIS-9988642, and ONR MURI N00014-00-1-0637. \n\nReferences \n\n[1] M. Anthony and P. Bartlett. Neural Network Learning: Th eoretical Foundations. Cam(cid:173)\n\nbridge University Press, 1999. \n\n[2] B. Efron. The efficiency of logistic regression compared to Normal Discriminant Anal(cid:173)\n\nysis. Journ. of the Amer. Statist. Assoc., 70:892- 898, 1975. \n\n[3] P. Goldberg and M. Jerrum. Bounding the VC dimension of concept classes parame(cid:173)\n\nterized by real numbers. Machine Learning, 18:131-148, 1995. \n\n[4] G.J. McLachlan. Discriminant Analysis and Statistical Pattern Recognition. Wiley, \n\nNew York, 1992. \n\n[5] Y. D. Rubinstein and T. Hastie. Discriminative vs. informative learning. In Proceedings \nof the Third International Conference on Knowledge Discovery and Data Mining, pages \n49- 53. AAAI Press, 1997. \n\n[6] V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998. \n\n\f", "award": [], "sourceid": 2020, "authors": [{"given_name": "Andrew", "family_name": "Ng", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}