{"title": "On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes", "book": "Advances in Neural Information Processing Systems", "page_first": 841, "page_last": 848, "abstract": null, "full_text": "On  Discriminative vs.  Generative \nclassifiers:  A  comparison of logistic \n\nregression and  naive  Bayes \n\nAndrew Y.  Ng \n\nComputer Science  Division \n\nMichael I. Jordan \n\nC.S.  Div.  &  Dept.  of Stat. \n\nUniversity of California,  Berkeley \n\nUniversity of California, Berkeley \n\nBerkeley,  CA  94720 \n\nBerkeley,  CA 94720 \n\nAbstract \n\nWe  compare discriminative  and  generative learning as  typified  by \nlogistic regression and naive Bayes.  We show,  contrary to a widely(cid:173)\nheld  belief  that  discriminative  classifiers  are  almost  always  to  be \npreferred,  that  there  can  often  be  two  distinct  regimes  of  per(cid:173)\nformance  as  the  training  set  size  is  increased,  one  in  which  each \nalgorithm  does  better.  This  stems  from  the  observation- which \nis  borne  out  in  repeated  experiments-\nthat  while  discriminative \nlearning has lower asymptotic error, a generative classifier may also \napproach its  (higher)  asymptotic error  much faster. \n\n1 \n\nIntroduction \n\nGenerative classifiers learn a  model  of the joint probability, p( x, y),  of the inputs x \nand the label y,  and make their predictions by using Bayes rules to calculate p(ylx), \nand then picking  the most likely  label  y.  Discriminative classifiers  model  the  pos(cid:173)\nterior p(ylx)  directly, or learn a direct map from inputs x  to the class labels.  There \nare several compelling reasons for  using discriminative rather than generative clas(cid:173)\nsifiers,  one of which,  succinctly articulated by Vapnik  [6],  is  that  \"one should solve \nthe  [classification]  problem  directly  and never  solve  a  more  general problem  as  an \nintermediate step  [such  as  modeling p(xly)].\"  Indeed,  leaving  aside  computational \nissues  and  matters  such  as  handling  missing  data,  the  prevailing  consensus  seems \nto be that discriminative classifiers are almost always to be preferred to generative \nones. \nAnother piece  of prevailing folk  wisdom is  that the number of examples  needed to \nfit  a  model  is  often  roughly  linear  in  the  number  of  free  parameters  of a  model. \nThis  has  its  theoretical  basis  in  the  observation that  for  \"many\"  models,  the  VC \ndimension  is  roughly  linear  or at  most  some  low-order  polynomial  in  the  number \nof  parameters  (see,  e.g.,  [1,  3]),  and  it  is  known  that  sample  complexity  in  the \ndiscriminative  setting  is  linear in the VC  dimension  [6]. \nIn  this  paper,  we  study  empirically  and  theoretically  the  extent  to  which  these \nbeliefs are true.  A parametric family of probabilistic models p(x, y)  can be fit  either \nto  optimize  the joint likelihood  of the inputs  and the labels,  or fit  to optimize  the \nconditional likelihood p(ylx), or even fit  to minimize the 0-1  training error obtained \n\n\fby  thresholding  p(ylx)  to  make  predictions.  Given  a  classifier  hGen  fit  according \nto  the  first  criterion,  and  a  model  hDis  fit  according  to  either  the  second  or  the \nthird  criterion  (using  the  same  parametric  family  of  models) ,  we  call  hGen  and \nhDis  a  Generative-Discriminative  pair.  For example, if p(xly)  is  Gaussian and p(y) \nis  multinomial,  then  the  corresponding  Generative-Discriminative  pair  is  Normal \nDiscriminant  Analysis  and  logistic  regression.  Similarly,  for  the  case  of  discrete \ninputs  it  is  also  well  known  that  the  naive  Bayes  classifier  and  logistic  regression \nform  a  Generative-Discriminative pair  [4, 5]. \nTo  compare  generative  and  discriminative  learning,  it  seems  natural  to  focus  on \nsuch pairs.  In  this paper, we  consider the naive Bayes model  (for  both discrete and \ncontinuous inputs)  and its  discriminative analog,  logistic  regression/linear classifi(cid:173)\ncation,  and show:  (a)  The generative model  does  indeed  have  a  higher asymptotic \nerror  (as  the  number  of training examples  becomes  large)  than the  discriminative \nmodel, but  (b)  The generative model may also  approach its asymptotic error much \nfaster than the discriminative model- possibly with a  number of training examples \nthat  is  only  logarithmic,  rather  than  linear,  in  the  number  of  parameters.  This \nsuggests-and our empirical results strongly support-that, as the number of train(cid:173)\ning  examples  is  increased,  there  can  be  two  distinct  regimes  of performance,  the \nfirst in which the generative model has already approached its asymptotic error and \nis  thus doing  better, and the second in which  the discriminative model approaches \nits lower  asymptotic error and does  better. \n\n2  Preliminaries \n\nWe  consider  a  binary  classification  task,  and begin  with  the  case  of discrete  data. \nLet  X  =  {O, l}n  be the n-dimensional input space,  where we  have  assumed binary \ninputs  for  simplicity  (the  generalization  offering  no  difficulties).  Let  the  output \nlabels be Y =  {T, F}, and let there be a joint distribution V  over  X  x Y from which \na  training set S  =  {x(i) , y(i) }~1 of m  iid  examples is  drawn.  The generative naive \nBayes  classifier  uses  S  to  calculate  estimates  p(xiIY)  and p(y)  of the  probabilities \np(xi IY)  and p(y),  as follows: \n\nP' (x- =  11Y  =  b)  =  #s{xi=l ,y=b}+1 \n\n#s{y-b}+21 \n\n, \n\n(and similarly for  p(y  =  b),)  where  #s{-}  counts  the number  of occurrences of an \n\nevent in the training set  S.  Here,  setting l  = \u00b0 corresponds to taking the empirical \n\nestimates of the probabilities, and l is more traditionally set to a positive value such \nas  1, which corresponds to using Laplace smoothing of the probabilities.  To classify \na  test example  x,  the naive  Bayes  classifier  hGen  : X  r-+  Y  predicts  hGen(x)  =  T  if \nand only if the following  quantity is  positive: \n\n(1) \n\n(2) \n\nIGen(x ) =  log  (rrn \n\n(rr~-d)(xi ly = T))p(y = T)  ~  p(xilY = T) \n\np(y = T) \n' (  _I  _ F)) '(  _  F)  =  L..,log  ' (  _I  _ F)  + log  ' (  _  F)' \nP Y -\n\ni=1 P  X, Y -\n\nP X,  Y -\n\nP Y -\n\ni=1 \n\nIn  the case of continuous inputs , almost  everything remains the same,  except  that \nwe  now  assume  X  =  [O,l]n,  and  let  p(xilY  =  b)  be  parameterized as  a  univariate \nGaussian  distribution  with  parameters  {ti ly=b  and  if;  (note  that  the  j1's,  but  not \nthe  if's ,  depend  on  y).  The  parameters  are  fit  via  maximum  likelihood,  so  for \nexample {ti ly=b  is  the empirical  mean of the i-th coordinate of all  the examples in \nthe training set with label y = b.  Note that this method is also equivalent to Normal \nDiscriminant Analysis assuming diagonal covariance matrices.  In the sequel, we also \nlet  J.tily=b  =  E[Xi IY  = b]  and a;  =  Ey[Var(xi ly)]  be the  \"true\"  means  and variances \n(regardless of whether the data are Gaussian or not). \nIn  both the discrete and the continuous cases, it is  well  known that the discrimina(cid:173)\ntive  analog of naive  Bayes  is  logistic  regression.  This  model  has  parameters  [,8, OJ, \nand posits that p(y =  Tlx; ,8, O)  =  1/(1 +exp(-,8Tx - 0)).  Given a  test example x, \n\n\fthe discriminative logistic regression classifier  hois  : X  I-t Y predicts  hOis (x)  =  T  if \nand only  if the linear discriminant function \n\nlDis(x)  =  L~=l (3ixi  + () \n\n(3) \nis  positive.  Being a  discriminative model,  the parameters [(3, ()]  can be fit  either to \nmaximize the conditionallikelikood on the training set L~=llogp(y(i) Ix(i); (3, ()),  or \nto minimize 0-1  training error L~= ll{hois(x(i)) 1- y(i)}, where 1{-}  is  the indicator \nfunction  (I{True} = 1, I{False}  = 0) .  Insofar as the error metric is 0-1 classification \nerror,  we  view  the latter alternative as  being more truly in the  \"spirit\"  of discrim(cid:173)\ninative  learning,  though  the  former  is  also  frequently  used  as  a  computationally \nefficient  approximation to the latter.  In  this paper, we  will largely ignore the differ(cid:173)\nence  between these two versions of discriminative learning and, with some abuse of \nterminology, will loosely use the term  \"logistic regression\" to refer to either, though \nour formal  analyses will  focus  on the latter method. \nFinally, let 1i be the family of all linear classifiers  (maps from  X  to Y);  and given a \nclassifier h : X  I-t y, define its generalization error to be c(h)  =  Pr(x,y)~v [h(x) 1- y]. \n\n3  Analysis  of algorithms \n\nWhen V  is such that the two classes are far from linearly separable, neither logistic \nregression  nor  naive  Bayes  can  possibly  do  well,  since  both  are  linear  classifiers. \nThus, to obtain non-trivial results, it is most interesting to compare the performance \nof  these  algorithms  to  their  asymptotic  errors  (cf.  the  agnostic  learning  setting). \nMore precisely, let hGen,oo  be the population version of the naive Bayes classifier; i.e. \nhGen,oo  is  the  naive  Bayes  classifier  with  parameters p(xly)  =  p(xly),p(y)  =  p(y). \nSimilarly,  let  hOis,oo  be the population version of logistic regression.  The following \ntwo propositions are then completely straightforward. \n\nProposition 1  Let  hGen  and  hDis  be  any  generative-discriminative  pair  of  clas(cid:173)\nsifiers,  and  hGen,oo  and  hois,oo  be  their  asymptotic/population  versions.  Thenl \nc(hDis,oo)  :S  c(hGen,oo). \nProposition 2  Let  hDis  be  logistic  regression  in  n-dimensions.  Then  with  high \nprobability \n\nc(hois ) :S  c(hois,oo) + 0  (J ~ log ~) \n\nThus,  for c(hOis) :S  c(hOis,oo) + EO  to  hold with high probability  (here,  EO  > 0  is  some \nfixed  constant),  it suffices  to  pick m  =  O(n). \n\nProposition  1 states  that  aymptotically,  the error of the discriminative  logistic  re(cid:173)\ngression  is  smaller  than  that  of  the  generative  naive  Bayes.  This  is  easily  shown \nby  observing that,  since  c(hDis)  converges  to  infhE1-l c(h)  (where  1i  is  the  class  of \nall  linear  classifiers),  it  must  therefore  be asymptotically  no  worse  than the linear \nclassifier  picked  by  naive  Bayes.  This  proposition  also  provides  a  basis  for  what \nseems  to  be  the  widely  held  belief  that  discriminative  classifiers  are  better  than \ngenerative ones. \nProposition  2  is  another  standard  result,  and  is  a  straightforward  application  of \nVapnik's uniform convergence bounds to logistic regression, and using the fact  that \n1i has VC  dimension n.  The second part of the proposition states that the sample \ncomplexity  of discriminative  learning-\nthat is,  the number  of examples  needed  to \nis  at most on the order of n.  Note that the worst \napproach the asymptotic error-\ncase sample complexity is  also lower-bounded by order n  [6]. \n\nlUnder  a  technical  assumption  (that  is  true  for  most  classifiers,  including  logistic  re(cid:173)\n\ngression)  that the family  of possible classifiers  hOis  (in  the case  of logistic  regression,  this \nis  1l)  has finite  VC  dimension. \n\n\fThe  picture  for  discriminative  learning  is  thus  fairly  well-understood:  The  error \nconverges to that of the  best  linear  classifier,  and  convergence occurs  after on the \norder  of n  examples.  How  about  generative  learning,  specifically  the  case  of the \nnaive Bayes classifier?  We  begin with the following  lemma. \n\nLemma 3  Let  any 101,8  > \u00b0 and  any  l  2: \u00b0 be  fixed.  Assume  that for  some  fixed \nPo  > 0,  we  have  that Po  :s:  p(y  =  T)  :s:  1 - Po.  Let m  =  0  ((l/Ei) log(n/8)).  Then \nwith probability  at least  1 - 8: \n\n1.  In  case  of  discrete  inputs,  IjJ(XiIY  =  b)  - p(xilY  =  b)1 \n\nb)  - p(y =  b) I :s:  101,  for  all i  =  1, ... ,n and bEY. \n\n:s:  101  and  IjJ(y  = \n\n2.  In the  case  of continuous  inputs,  IPily=b  - f-li ly=b I :s:  101,  laT - O\"T I :s:  101,  and \nProof (sketch).  Consider the discrete case,  and let l = \u00b0 for  now.  Let  101  :s:  po/2. \n\nIjJ(y  = b)  - p(y = b) I :s:  101  for  all i  = 1, ... ,n and bEY. \n\nBy  the  Chernoff bound,  with  probability  at least  1 - 81  =  1- 2exp(-2Eim) ,  the \nfraction  of positive  examples  will  be  within  101  of p(y  =  T) , which  implies  IjJ(y  = \nb)  - p(y =  b)1  :s:  101,  and  we  have  at least 1 m  positive  and 1m negative  examples, \nwhere  I  =  Po  - 101  =  0(1).  So  by  the  Chernoff bound  again,  for  specific  i,  b, the \nchance that  IjJ(XiIY  = b)  - p(xilY = b)1  > 101  is  at most  82  =  2exp(-2Ehm).  Since \nthere are  2n such probabilities, the overall chance of error, by the Union bound, is \nat most 81 + 2n82 .  Substituting in 81 and 8/s definitions, we  see that to guarantee \n81 + 2n82  :s:  8,  it suffices that m  is as stated.  Lastly, smoothing (l  > 0)  adds at most \na  small,  O(l/m) perturbation to these probabilities , and using the same argument \nas  above with  (say)  101/2  instead of 101,  and arguing that this  O(l/m) perturbation \nis  at most 101/2  (which it is  as m  is  at least order  l/Ei) , again gives the result.  The \nresult  for  the  continuous  case  is  proved  similarly  using  a  Chernoff-bounds  based \nargument  (and the assumption that Xi  E [0,1]). \nD \nThus, with a number of samples that is only logarithmic, rather than linear, in n, the \nparameters of the generative classifier hGen are uniformly close to their asymptotic \nvalues in hGen ,oo .  Is  is  tempting to conclude therefore that c(hGen),  the error of the \ngenerative naive Bayes classifier, also converges to its asymptotic value of c(hGen,oo) \nafter  this  many  examples,  implying  only  0 (log n)  examples  are  required  to  fit  a \nnaive  Bayes model.  We  will  shortly establish some  simple  conditions  under  which \nthis intuition is indeed correct.  Note that this implies that, even though naive Bayes \nconverges to a higher asymptotic error of c(hGen,oo)  compared to logistic regression's \nc: (hDis,oo ),  it  may also  approach it significantly faster-after  O(log n),  rather  than \nO(n), training examples. \nOne way of showing c(hGen)  approaches c(hGen,oo)  is  by showing that the parame(cid:173)\nters'  convergence implies  that  hGen  is  very  likely  to  make  the  same predictions  as \nhGen,oo .  Recall  hGen  makes  its  predictions  by  thresholding  the  discriminant  func(cid:173)\ntion  lGen  defined  in  (2).  Let  lGen,oo  be  the  corresponding  discriminant  function \nused by hGen,oo.  On every example on which both lGen  and lGen,oo  fall  on the same \nside of zero,  hGen  and hGen,oo  will  make the same prediction.  Moreover,  as  long as \nlGen,oo (x)  is, with fairly  high probability, far from  zero,  then lGen (x),  being a  small \nperturbation of lGen,oo(x),  will also be usually on the same side ofzero as lGen,oo (x). \nTheorem 4  Define G(T)  =  Pr(x,y)~v[(lGen ,oo(x)  E  [O,Tn] A y =  T) V (lG en,oo(X) E \n[-Tn, O]A Y = F)].  Assume  that for  some fixed  Po  > 0,  we  have  Po  :s:  p(y = T)  :s: \n:s:  P(Xi  =  11Y  =  b)  :s:  1 - Po  for  all  i, b  (in  the  case  of \n1 - Po,  and  that  either Po \ndiscrete  inputs),  or O\"T  2:  Po  (in  the  continuous  case).  Then  with  high  probability, \n\nc:(hGen ) :s:  c:(hGen,oo) + G (0 (J ~ logn)) . \n\n(4) \n\nc(hGen)  - c(hGen,oo)  is  upperbounded  by  the  chance  that \nProof  (sketch). \nhGen,oo  correctly  classifies  a  randomly  chosen  example,  but  hGen  misclassifies  it. \n\n\fLemma 3 ensures that, with high probability, all the parameters of hGen  are within \nO( j(log n)/m) of those of hGen,oo .  This in turn implies that everyone of the n + 1 \nterms in  the sum in  lGen  (as  in  Equation 2)  is  within O( j(1ogn)/m) of the corre(cid:173)\nsponding term in lGen,oo , and hence that  IlGen(x)  -lGen,oo(x)1  :S  O(nj(1ogn)/m). \nLetting T =  O( j(logn)/m), we therefore see that it is possible for  hGen,oo  to be cor(cid:173)\nrect and hGen  to be wrong on an example (x , y)  only if y = T  and lGen,oo(X)  E  [0, Tn] \n:S  0),  or  if  y  =  F  and \n(so  that  it  is  possible  that  lGen,oo(X) \nlGen,oo(X)  E  [-Tn, 0].  The probability  of this  is  exactly  G(T),  which  therefore  up(cid:173)\nperbounds c(hGen)  - c(hGen,oo ). \nD \nThe  key  quantity  in  the  Theorem  is  the  G(T) ,  which  must  be  small  when  T  is \nsmall  in  order  for  the  bound  to  be  non-trivial.  Note  G(T)  is  upper-bounded  by \nPrx[lGen,oo(x)  E  [-Tn, Tn]]-the  chance  that  lGen,oo(X)  (a random  variable  whose \ndistribution is induced by x  \"\"'  V) falls near zero.  To gain intuition about the scaling \nof these random variables,  consider the following: \n\nlGen (x) \n\n:::::  0, \n\n::::: \n\nProposition 5  Suppose  that,  for  at  least  an  0(1)  fraction  of the  features  i  (i  = \n1, ... ,n),  it  holds  true  that  IP(Xi  =  11Y  =  T)  - P(Xi  =  11Y  =  F)I  ::::: \n'Y  for  some \nfixed'Y  >  0  (or  IJLi ly=T  - JLi ly=FI \n'Y  in  the  case  of  continuous  inputs).  Then \nE[lGen ,oo(x)ly  =  T]  =  O(n),  and  -E[lGen,oo (x)ly  =  F]  =  O(n). \nThus,  as  long  as  the  class  label  gives  information  about  an  0(1)  fraction  of  the \nfeatures  (or less formally,  as long as most of the features are  \"relevant\"  to the class \nlabel),  the  expected  value  of  IlGen, oo(X) I will  be  O(n).  The  proposition  is  easily \nproved  by  showing  that,  conditioned  on  (say)  the  event  y  =  T,  each  of the terms \nin  the  summation in  lGen,oo(x)  (as  in  Equation  (2),  but  with fi's  replaced  by p's) \nhas  non-negative expectation  (by  non-negativity of KL-divergence),  and  moreover \nan 0(1)  fraction  of them have expectation bounded away from  zero. \nProposition  5  guarantees  that  IlGen,oo (x)1  has  large  expectation,  though  what  we \nwant  in  order  to  bound  G  is  actually  slightly  stronger,  namely  that  the  random \nvariable  IlGen,oo (x)1  further  be  large/far  from  zero  with  high  probability.  There \nare several ways  of deriving sufficient  conditions for  ensuring that G  is  small.  One \nway  of obtaining  a  loose  bound  is  via  the  Chebyshev  inequality.  For  the  rest  of \nthis  discussion,  let  us  for  simplicity  implicitly  condition  on  the  event  that  a  test \nexample  x  has  label  T.  The  Chebyshev  inequality  implies  that  Pr[lGen ,oo(x)  :S \nE[lGen ,oo(X)]  - t]  :S  Var(lGen,oo(x))/t2 .  Now,  lGen,oo (X)  is  the  sum  of  n  random \nvariables  (ignoring the term  involving the  priors p(y)).  If (still  conditioned on  y), \nthese  n  random  variables  are  independent  (i.e.  if  the  \"naive  Bayes  assumption,\" \nthat the xi's are conditionally independent given y, holds), then its variance is O(n); \neven if the n  random variables were not completely independent, the variance may \nstill  be  not  much  larger  than  0 (n)  (and  may  even  be  smaller,  depending  on  the \nsigns of the correlations), and is  at most O(n2).  So, if E[lGen,oo (x)ly  =  T]  =  an (as \nwould  be  guaranteed by  Proposition  5)  for  some  a  >  0,  by  setting  t  =  (a  - T)n, \nChebyshev's inequality gives Pr[lGen,oo(x)  :S  Tn]  :S  O(l/(a - T)2n1/)  (T  < a), where \n1}  =  0  in  the  worst  case,  and  1}  =  1  in  the  independent  case.  This  thus  gives \na  bound  for  G(T),  but  note  that  it  will  frequently  be  very  loose.  Indeed,  in  the \nunrealistic  case  in  which  the  naive  Bayes  assumption  really  holds,  we  can  obtain \nthe much  stronger  (via the  Chernoff bound)  G(T):S exp(-O((a - T)2n)) , which is \nexponentially  small  in  n.  In  the  continuous  case,  if lGen,oo (x)  has  a  density  that, \nwithin some small interval [-m,mJ, is  uniformly bounded by O(l/n), then we  also \nhave G(T)  =  O(T).  In  any case,  we  also have the following  Corollary to Theorem 4. \nCorollary  6  Let the  conditions  of Theorem 4 hold,  and suppose  that G(T)  :S  Eo/2+ \nF(T)  for  some  function  F(T)  (independent  of n)  that  satisfies  F(T)  -+  0  as  T -+  0, \n:S  c(hGen,oo)  + EO  to  hold  with  high \nand  some  fixed  EO  >  O.  Then  for  \u20ac(hGen) \n\n\fpima (continuous) \n\nadult (continuous) \n\n0.5 \n\n0.3 \n\n0.250 \n\n0.45  \" \n0.4 \n\ne \n;;; \n0.35 \n\n~-2~~~, ____ \n\n'--, \n\n0.5 \n\n0.45 \n\n0.4 \n\ngO.35 \n~ \n\n0.3 \n\n0.25 \n\nboston (predict it > median price, continuous) \n0.45, - - - - - - - - - - - - - ,  \n\n0.4 \n\nI::: \\ ... ~~ __ \n\n20 \n\n40 \n\n60 \n\n0\"0 \n\n10 \n\n20 \n\n30 \n\n02Q- - -\"\"2\"'0- - -4\"'0-----\"cJ60 \n\noptdigits (O's and  1 's, continuous) \n\no.4,---- - - - - - - - - - ,  \n\noptdigits (2's and 3's,  continuous) \n\n0.4\", - - - - - - - - - ----. \n\nionosphere (continuous) \n\n0.5,---- - - - - - - - - - - - - - ,  \n\n0.3 \n\n0.1 \n\n0.3 \n\n~0.2 \n01 ~ \n\n50 \n\n100 \n\n150 \n\n200 \n\nliver disorders (continuous) \n\n0.5, - - - - - - - - - - - - - - - - - ,  \n\n0.2 \n\n0.7,---- - - - - - - - - - - - - - ,  \n\nadult (discrete) \n\n0.6 \n\n0.45 \n\n~ \n\n0.4 \n\n0.350 \n\n0.5 \n\n~  0.4 \n~ \n0.35 \n\n0.3 \n\n0.250 \n\n20 \n\n40 \n\n60 \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\n120 \n\n100 \n\n200 \n\n300 \n\n400 \n\npromoters (discrete) \n\nlymphography (discrete) \n\n0.5 \n\nbreast cancer (discrete) \n\n~:: ~ ......\u2022...\u2022..\u2022\u2022.\u2022.\u2022..\u2022.\u2022..\u2022.\u2022.\u2022.\u2022.\u2022...\u2022...\u2022. \u2022.. _. \n\n0.2 \n\n~ \n\n,-\n\n~ 0.4 ~'\\~.~:::.:~ \u2022...\u2022 \n\n..... \n\n~0 . 3 \n\n\"'\" \n\n0.1 \n\n0.2 \n\n..... \".\"\"\"\"\",,. \n\n%L--~20~-~40---'-6~0---'---'8~0-~100 \n\n0.10'------,5~0---.,.10~0---.---,J \n150 \n\nlenses (predict hard vs. soft,  discrete) \n\n0.8, - - - - - - - - - - - - - ,  \n\nsick (discrete) \n\n0'6\\=~_ \n\ngOA \n~ \n\n0.2 \n\n..... \n\n0.5 \n\n0.4 \n\n0.2 \n\n0.45 \n\n~  0.4 \n\n~ 0.35 \n\n0.3 \n\n0.250 \n\n0.4 \n\n0.3 \n\ngO.2 \n~ \n\n0.1  \\--. \n\n100 \n\n200 \n\n300 \n\nvoting records (discrete) \n\n0.10'---~-~10c--~15--2~0-~25 \n\n%~----,5~0---~10~0~--,,! 150 \n\n00 \n\n20 \n\n40 \n\n60 \n\n80 \n\n------------ -- --- --- --- -\n\nFigure  1:  Results  of 15  experiments  on  datasets  from  the  VCI  Machine  Learning \nrepository.  Plots  are  of  generalization  error  vs.  m  (averaged  over  1000  random \ntrain/test splits).  Dashed line  is logistic regression;  solid line is  naive Bayes. \n\n\fprobability,  it suffices  to pick m  =  O(log n). \n\nNote  that  the  previous  discussion  implies  that  the  preconditions  of the  Corollary \ndo  indeed  hold  in  the case  that the  naive  Bayes  (and  Proposition  5's)  assumption \nholds,  for  any  constant  fa  so  long  as  n  is  large  enough  that  fa  :::::  exp( -O(o:2n)) \n(and similarly for  the bounded Var(lGen,oo (x))  case,  with the more restrictive fa  ::::: \nO(I/(o:2n17))).  This also means that either ofthese (the latter also requiring T)  > 0) \nis  a  sufficient  condition for  the asymptotic sample complexity to be 0 (log n). \n\n4  Experiments \n\nThe results of the previous section imply that even though the discriminative logis(cid:173)\ntic  regression  algorithm  has  a  lower  asymptotic  error,  the  generative  naive  Bayes \nclassifier may also converge more quickly to its  (higher)  asymptotic error.  Thus, as \nthe number of training examples m  is  increased, one would expect generative naive \nBayes to initially do  better, but for  discriminative logistic regression  to eventually \ncatch up to, and quite likely  overtake, the performance of naive Bayes. \nTo test these predictions, we  performed experiments on 15  datasets, 8 with contin(cid:173)\nuous inputs,  7 with discrete inputs, from  the VCI Machine Learning repository.2 \nThe results ofthese experiments are shown in Figure 1.  We find that the theoretical \npredictions are borne out surprisingly well.  There are a  few  cases in  which  logistic \nregression's performance did not catch up to that of naive Bayes, but this is observed \nprimarily in  particularly small  datasets in  which  m  presumably cannot  grow  large \nenough for  us  to observe the expected dominance of logistic regression in  the large \nm  limit. \n\n5  Discussion \n\nEfron  [2]  also  analyzed  logistic  regression  and  Normal  Discriminant  Analysis  (for \ncontinuous  inputs) ,  and  concluded  that  the  former  was  only  asymptotically  very \nslightly (1/3- 1/2 times)  less statistically efficient.  This is  in marked contrast to our \nresults, and one key difference is that, rather than assuming P(xly) is  Gaussian with \na diagonal covariance matrix (as we  did), Efron considered the case where P(xly) is \nmodeled  as  Gaussian with a  full  convariance matrix.  In this setting,  the estimated \ncovariance matrix is singular if we  have fewer  than linear in n  training examples, so \nit is  no  surprise that Normal  Discriminant Analysis  cannot learn much faster  than \nlogistic  regression  here.  A  second  important  difference  is  that  Efron  considered \nonly  the  special  case  in  which  the  P(xly)  is  truly  Gaussian.  Such  an  asymptotic \ncomparison  is  not  very  useful  in  the  general  case,  since  the  only  possible  conclu(cid:173)\nsion,  if \u20ac(hDis,oo)  < \u20ac(hGen,oo), \nis  that logistic  regression is  the superior algorithm. \nIn  contrast,  as  we  saw  previously,  it  is  in  the  non-asymptotic  case  that  the  most \ninteresting  \"two-regime\"  behavior is  observed. \nPractical classification algorithms generally involve some form of regularization-\nin \nparticular logistic regression can often be improved upon in  practice by techniques \n\n2To maximize the consistency with the theoretical discussion, these experiments avoided \ndiscrete/continuous hybrids by considering only the discrete or only the continuous-valued \ninputs for  a  dataset where necessary.  Train/test splits were random subject to there being \nat least one example of each class in the training set, and continuous-valued inputs were also \nrescaled to  [0, 1]  if necessary.  In the case of linearly separable datasets,  logistic regression \nmakes no distinction between the many possible separating planes.  In this setting we  used \nan  MCMC  sampler  to  pick  a  classifier  randomly  from  them  (i.e.,  so  the  errors  reported \nare  empirical  averages  over  the separating  hyperplanes) .  Our implementation of Normal \nDiscriminant  Analysis  also  used  the  (standard)  trick  of adding  \u20ac \nto  the  diagonal  of the \ncovariance matrix to ensure invertibility, and for  naive Bayes we  used I =  1. \n\n\fsuch as shrinking the parameters via an L1  constraint, imposing a margin constraint \nin the separable case, or various forms  of averaging.  Such regularization techniques \ncan be viewed  as changing the model family,  however,  and as  such they are largely \northogonal to the analysis  in  this  paper, which  is  based on examining  particularly \nclear  cases  of Generative-Discriminative  model  pairings.  By  developing  a  clearer \nunderstanding  of  the  conditions  under  which  pure  generative  and  discriminative \napproaches are most successful, we  should be better able to design hybrid classifiers \nthat enjoy the best properties of either across a  wider  range of conditions. \nFinally, while our discussion has focused on naive Bayes and logistic regression, it is \nstraightforward to extend the analyses to several other models, including generative(cid:173)\ndiscriminative pairs generated by using a  fixed-structure, bounded fan-in  Bayesian \nnetwork model for  P(xly)  (of which  naive Bayes is  a  special case). \n\nAcknowledgments \nWe  thank  Andrew  McCallum  for  helpful  conversations.  A.  Ng  is  supported  by  a \nMicrosoft  Research fellowship.  This  work was  also supported by a  grant from  Intel \nCorporation, NSF  grant IIS-9988642,  and ONR MURI N00014-00-1-0637. \n\nReferences \n\n[1]  M.  Anthony and P.  Bartlett.  Neural  Network Learning:  Th eoretical Foundations.  Cam(cid:173)\n\nbridge University Press, 1999. \n\n[2]  B.  Efron.  The efficiency of logistic  regression  compared to Normal Discriminant Anal(cid:173)\n\nysis.  Journ.  of the  Amer.  Statist.  Assoc.,  70:892- 898,  1975. \n\n[3]  P.  Goldberg  and M.  Jerrum.  Bounding the VC  dimension of concept classes  parame(cid:173)\n\nterized by real numbers.  Machine  Learning,  18:131-148,  1995. \n\n[4]  G.J.  McLachlan.  Discriminant  Analysis  and  Statistical  Pattern  Recognition.  Wiley, \n\nNew York, 1992. \n\n[5]  Y.  D. Rubinstein and T. Hastie. Discriminative vs.  informative learning. In Proceedings \nof the Third International Conference  on Knowledge  Discovery and Data Mining,  pages \n49- 53. AAAI Press,  1997. \n\n[6]  V.  N. Vapnik.  Statistical  Learning  Theory.  John Wiley &  Sons,  1998. \n\n\f", "award": [], "sourceid": 2020, "authors": [{"given_name": "Andrew", "family_name": "Ng", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}