{"title": "Model Complexity, Goodness of Fit and Diminishing Returns", "book": "Advances in Neural Information Processing Systems", "page_first": 388, "page_last": 394, "abstract": null, "full_text": "Model  Complexity,  Goodness of Fit  and \n\nDiminishing Returns \n\nIgor V.  Cadez \n\nPadhraic Smyth \n\nInformation and  Computer Science \n\nInformation and Computer Science \n\nUniversity of California \n\nIrvine,  CA  92697-3425, U.S.A. \n\nUniversity of California \n\nIrvine,  CA  92697-3425,  U.S.A. \n\nAbstract \n\nWe  investigate a  general characteristic of the trade-off in  learning \nproblems  between  goodness-of-fit  and  model  complexity.  Specifi(cid:173)\ncally we  characterize a general class of learning problems where the \ngoodness-of-fit  function  can  be  shown  to  be  convex  within  first(cid:173)\norder  as  a  function  of model  complexity.  This  general  property \nof  \"diminishing  returns\"  is  illustrated  on  a  number  of real  data \nsets  and learning problems, including finite  mixture modeling and \nmultivariate linear regression. \n\nIntroduction, Motivation,  and Related Work \n\n1 \nAssume  we  have  a  data set  D  = {Xl, X2, ... , x n },  where  the  X i  could  be  vectors, \nsequences,  etc.  We  consider  modeling  the  data set  D  using  models  indexed  by  a \ncomplexity index k,  1 :::;  k :::;  kmax \u2022  For example, the models could be finite  mixture \nprobability  density  functions  (PDFs)  for  vector  Xi'S  where  model  complexity  is \nindexed by the number of components k  in the mixture.  Alternatively, the modeling \ntask  could  be to  fit  a  conditional  regression  model  y  = g(Zk)  + e,  where  now  y  is \none of the variables in the vector  X  and  Z  is  some subset of size  k  of the remaining \ncomponents in the  X  vector. \n\nSuch learning tasks can typically be characterized by the existence of a  model and \na  loss  function.  A fitted  model of complexity  k  is  a  function  of the data points  D \nand depends on a  specific  set of fitted  parameters B.  The loss  function  (goodness(cid:173)\nof-fit)  is  a  functional  of the  model  and  maps  each specific  model to  a  scalar  used \nto evaluate the model,  e.g.,  likelihood for  density  estimation or sum-of-squares for \nregression. \n\nFigure 1 illustrates a typical empirical curve for loss function versus complexity, for \nmixtures  of  Markov  models  fitted  to  a  large  data set  of 900,000  sequences.  The \ncomplexity k  is the number of Markov models being used in the mixture (see  Cadez \net  al. \n(2000)  for  further  details  on  the  model  and  the  data set).  The  empirical \ncurve  has  a  distinctly  concave  appearance,  with large  relative  gains  in fit  for  low \ncomplexity models and much more modest relative gains for high complexity models. \nA  natural  question  is  whether  this  concavity  characteristic  can  be  viewed  as  a \ngeneral phenomenon in learning and under what assumptions on model classes and \n\n\fNwnber of M Ixture Cmnponen1S  11] \n\nFigure 1:  Log-likelihood scores for  a  Markov mixtures data set. \n\nloss  functions  the  concavity  can  be  shown  to  hold.  The  goal  of this  paper  is  to \nillustrate that in fact  it is  a  natural characteristic for  a  broad range of problems in \nmixture modeling and linear regression. \n\nWe  note of course that for  generalization that using goodness-of-fit  alone  will  lead \nto  the  selection  of  the  most  complex  model  under  consideration  and  will  not  in \ngeneral select  the  model  which  generalizes  best  to new  data.  Nonetheless  our pri(cid:173)\nmary focus  of interest  in  this  paper is  how  goodness-of-fit  loss  functions  (such  as \nlikelihood and squared error, defined on the training data D)  behave in general as a \nfunction of model complexity k.  Our concavity results have a number of interesting \nimplications.  For example, for  model  selection methods which  add a  penalty term \nto the goodness-of-fit  (e.g., BIC), the resulting score function as a function of model \ncomplexity will  be unimodal as a  function  of complexity k  within first  order. \n\nLi and Barron (1999)  have shown that for finite mixture models the expected value \nof the log-likelihood for  any  k  is  bounded  below  by  a  function  of the  form  -C /k \nwhere  C  is  a  constant  which  is  independent  of k.  The  results  presented  here  are \ncomplementary in the sense that we show that the actual maximizing log-likelihood \nitself is  concave  to  first-order  as  a  function  of k.  Furthermore,  we  obtain  a  more \ngeneral principle of \"diminishing returns,\" including both finite mixtures and subset \nselection in regression. \n\n2  Notation \n\nWe  define  y  =  y(x)  as  a  scalar function  of x,  namely a  prediction  at  x.  In linear \nregression  y  = y(x)  is  a  linear  function  of the  components  in  x  while  in  density \nestimation y = y(x)  is  the  value  of the  density function  at  x.  Although the goals \nof regression and density estimation are quite  different,  we  can view  them both as \nsimply techniques for  approximating an unknown true function for  different  values \nof x.  We  denote  the  prediction  of a  model  of complexity  k  as  Yk (xIB)  where  the \nsubscript indicates the model complexity and B is the associated set of fitted param(cid:173)\neters.  Since different choices of parameters in general yield different models,  we will \ntypically  abbreviate  the  notation  somewhat  and  use  different  letters  for  different \nparameterizations of the same functional  form  (i.e.,  the same complexity), e.g.,  we \nmay use Yk(X),gk(X), hk(X) to refer to models of complexity k  instead of specifying \nYk(xIBd,Yk(xIB2 ),Yk(xIB3 ), etc.  Furthermore, since all models  under discussion are \nfunctions of x, we  sometimes omit the explicit dependence on x  and use a  compact \nnotation Yk, 9k, hk\u00b7 \nWe  focus  on classes  of models  that  can be characterized by  more  complex  models \nhaving a  linear dependence on simpler models within the class.  More formally,  any \n\n\fmodel of complexity  k  can be decomposed  as: \n\nYk  = a191  + a2h1  + ... + ak W 1\u00b7 \n\n(1) \nIn PDF mixture modeling we have Y k  =  p( x)  and each model 91, hI, .. .  ,Zl is a basis \nPDF  (e.g.,  a  single Gaussian)  but with different  parameters.  In multivariate linear \nregression each model 91, hI, ... ,WI represents a regression on a single variable, e.g., \n91(X)  above  is  91(X)  = 'Ypxp  where  xp  is  the p-th variable in the set  and  'Yp  is  the \ncorresponding  coefficient  one  would  obtain if regressing  on  xp  alone.  One  of the \n91, hI, ... ,WI  can be a  dummy constant variable to account for  the intercept term. \nNote  that  the  total  parameters for  the  model  Yk  in  both  cases  can  be  viewed  as \nconsisting  of both  the  mixing  proportions  (the  a's)  and  the  parameters for  each \nindividual component  model. \n\nThe loss function is a functional on models and we  write it as E(Yk).  For simplicity, \nwe  use  the  notation  EZ  to  specify  the  value  of the  loss  function  for  the  best  k(cid:173)\ncomponent model.  This way,  EZ  :S  E(Yk) for  any model Yk1.  For example, the loss \nfunction in PDF mixture modeling is the negative log likelihood.  In linear regression \nwe use empirical mean squared error (MSE)  as the loss function.  The loss functions \nof general interest in this context are those that decompose into a  sum of functions \nover data points in  the data set  D  (equivalently  an independence  assumption in  a \nlikelihood framework), i.e., \n\nn \n\n(2) \n\nFor example, in PDF mixture modeling  !(Yk) = -In Yk,  while in regression model(cid:173)\ning !(Yk) = (y - Yk)2  where Y is  a  known target value. \n\ni=l \n\n3  Necessary  Conditions on Models and Loss  Functions \n\nWe  consider  models that satisfy  several conditions  that are commonly  met  in  real \ndata analysis applications and are satisfied by both PDF mixture models and linear \nregression models: \n\n1.  As  k  increases we  have a nested model class, i.e., each model of complexity \nk  contains each model of complexity k'  < k  as a special case (i.e., it reduces \nto a  simpler model for  a  special  choice of the parameters). \n\n2.  Any two  models of complexities  k1  and k2  can be combined as  a  weighted \n\nsum in any proportion to yield  a valid  model of complexity k = k1  + k2. \n\n3.  Each model of complexity k  =  k1  + k2  can be decomposed into a  weighted \nsum  of two  valid  models  of complexities  k1  and  k2  respectively  for  each \nvalid choice of k1  and k2. \n\nThe first  condition guarantees that the loss function is  a  non-increasing function of \nk  for  optimal models of complexity  k  (in  sense of minimizing the loss function  E), \nthe second condition prevents artificial correlation between the component models, \nwhile  the  third  condition  guarantees that  all  components  are  of equal  expressive \npower.  As  an  example,  the  standard  Gaussian  mixture  model  satisfies  all  three \nproperties whether the covariance matrices are unconstrained or individually  con(cid:173)\nstrained.  As  a  counter-example,  a  Gaussian  mixture  model  where  the  covariance \nmatrices  are  constrained  to  be  equal  across  all  components  does  not  satisfy  the \nsecond property. \n\nlWe  assume  the  learning  task  consists  of minimization  of the  loss  function.  If maxi(cid:173)\n\nmization is  more appropriate , we can just consider minimization of the negative of the loss \nfunction. \n\n\f4  Theoretical Results  on Loss  Function Convexity \n\nWe  formulate  and prove the following  theorem: \nTheorem  1:  In  a  learning problem that satisfies the properties from  Section  3,  the \nloss function is first order convex in model complexity k,  meaning that EZ+1 - 2EZ + \nEZ_ 1  ~ 0 within first  order  (as  defined in the proof).  The quantities EZ  and EZ\u00b1l \nare the values of the loss function  for  the best  k  and  k \u00b1 I-component models. \nProal:  In  the first  part of the proof we  analyze a general difference of loss functions \nand  write  it  in  a  convenient  form.  Consider  two  arbitrary  models,  9  and  hand \nthe  corresponding loss  functions  E(g)  and  E(h)  (g  and  h  need  not  have the same \ncomplexity).  The difference in loss functions  can be expressed as: \n\nE(g) - E(h) \n\nn \nL  {I [g(Xi)]  - I  [h(Xi)]} \ni=l \nn \nL  {I [h(xi)(1 + Jg ,h(Xi))]- I  [h(Xi)]} \ni=l \nn \n\n=  a L  h(Xi)!' (h(Xi)) Jg ,h(Xi). \n\ni=l \n\n(3) \n\nwhere the last equation comes from a first order Taylor series expansion around each \nJg ,h(Xi)  = 0,  a  is  an  unknown  constant  of proportionality  (to  make  the  equation \nexact)  and \n\nJ  () -=- g(x) - h(x) \ng,h  X \n\nh(x) \n\n-\n\n(4) \n\nrepresents the relative  difference in models 9 and h  at point x.  For example, Equa(cid:173)\ntion  3  reduces to a  first  order  Taylor  series  approximation for  a  = 1.  If I (y)  is  a \nconvex function we  also  have: \n\nE(g) - E(h)  ~ L  h(Xi)!'(h(Xi))Jg,h(Xi). \n\n(5) \n\nn \n\ni=l \n\nsince the remainder in the Taylor  series expansion R2  = I/2f\"(h(I + 8J))J2  ~ O. \nIn  the second  part of the proof we  use  Equation 5 to derive an appropriate condi(cid:173)\ntion on loss  functions.  Consider the  best  k  and  k  \u00b1 I-component  models  and the \nappropriate difference of the corresponding loss functions EZ+1 - 2EZ + EZ_ 1 ,  which \nwe can write using the notation from  Equation 3 and Equation 5 (since we  consider \nconvex  functions  I(y)  = -lny for  PDF  modeling  and  I(y)  = (y  - Yi)2  for  best \nsubset  regression)  as: \n\nEZ+1  - 2EZ  + EZ_ 1  = \n\nn \n\ni =l \n\nn \n\nn \n\ni=l \nn \n\n>  LyZ(Xi)!'(yZ(Xi))JYZ+1 ,YZ(Xi)  + L  yZ(Xi)!'(yZ(Xi))JyZ_1,YZ (Xi) \n\ni=l \n\ni=l \n\nn \n\n=  LyZ(Xi)!'(yZ(Xi))  [JYZ+1 ,YZ(Xi)  + JYZ_1,YZ(Xi)]  . \n\ni=l \n\n(6) \n\n\fAccording  to  the  requirements  on  models  in Section  3,  the  best  k  + I-component \nmodel can be decomposed  as \n\nY'k+1  = (1  - E)gk  + Eg1, \n\nwhere  gk  is  a  k-component  model  and  gl  is  a  I-component  modeL  Similarly,  an \nartificial model  can  be  constructed from  the best  k  -\nek  =  (1  - E)Y'k-1  + Eg1\u00b7 \n\nI-component model: \n\nUpon subtracting y'k  from  each of the equations and dividing by Y'k,  using notation \nfrom  Equation 4,  we  get: \n\n(1  - E)09k ,y;'  + EOg1 ,y;, \n(1- E)OY;'_l 'Y;'  + EOg1 ,y;\" \nwhich  upon subtraction and rearrangement of terms yields: \n\n= \n\no \u2022  \u2022 \nYk+1 'Yk \nO~k ,y;' \n\nk+l'k \n\nOy' \n\ny'  + Oy' \n\ny'  =  (1  - E)09k  y'  + oCk y'  + EOy' \n\n(7) \nIf we  evaluate this equation at each of the data points Xi  and substitute the result \nback into equation 6 we  get: \nE'k+1  - 2E'k  + E'k-1  ~ \n\ny\" \n1;:-I'k \n\n1;:-I'k \n\n' k \n\n~ ' k \n\nn \n\nLY'k (Xi)!' (Y'k(Xi))  [(1 - E)09k ,y;, (Xi) + O~k ,y;' (Xi)  + EOy;'_l'Y;' (Xi )]' \n\n(8) \n\ni =l \n\nIn  the  third  part  of the  proof we  analyze  each  of the  terms  in  Equation  8  using \nEquation 3.  Consider the first  term: \n\nn \n\nilgk ,y;,  = LY'k(xd!'(Y'k(Xi))09k ,y;, (Xi ) \n\ni=l \n\n(9) \n\nthat depends on a relative difference of models gk  and y'k  at each of the data points \nXi .  According to Equation 3, for small 09k ,Y;'  (Xi)  (which is presumably true), we can \nset  a:  :::::;  1 to  get  a  first  order Taylor expansion.  Since  y'k  is  the best  k-component \nmodel,  we  have E(gk)  ~ E(y'k)  =  Ek  and consequently \n\n(10) \nNote that in order to have the last  inequality hold,  we  do  not  require that a:  :::::;  1, \nbut only that \n\nE(gk) - E(yk) =  a:ilgk ,y;,  :::::;  ilgk ,y;,  ~ 0 \n\n(11) \nwhich  is  a  weaker  condition  that  we  refer  to  as  the  first  order  approximation. \nIn  other  words,  we  only  require  that  the  sign  is  preserved  when  making  Taylor \nexpansion  while  the  actual  value  need  not  be  very  accurate.  Similarly,  each  of \nthe  three  terms  on  the  right  hand  side  of  Equation  8  is  first  order  positive  since \nE(yk)  ::;  E(gk), E(ek), E(Y'k-1)'  This shows that \n\na:~0 \n\nwithin first  order, concluding the proof. \n\nEk+1 - 2Ek + E'k-1  ~ 0 \n\n5  Convexity in Common Learning  Problems \n\nIn this  section  we  specialize  Theorem  1 to several  well-known  learning situations. \nEach proof consists of merely selecting the appropriate loss function E (y)  and model \nfamily  y. \n\n\f5.1  Concavity of Mixture Model Log-Likelihoods \n\nTheorem 2:  In mixture model learning, using log-likelihood as the loss function and \nusing  unconstrained  mixture  components,  the  in-sample  log  likelihood  is  a  first(cid:173)\norder concave function  of the complexity k. \n\nProol:  By  using  I (y)  =  -In Y  in  Theorem  1  the  loss  function  E(y)  becomes  the \nnegative of the in-sample log likelihood,  hence it  is  a first-order  convex function  of \ncomplexity k,  i.e.,  the log likelihood is  first-order concave. \nCorollary  1:  If a linear or convex penalty term in k  is subtracted from the in-sample \nlog likelihood in Theorem 2, using the mixture models as defined in Theorem 2, then \nthe penalized likelihood can have at most one maximum to within first  order.  The \nBIC  criterion satisfies this criterion for  example. \n\n5.2  Convexity  of Mean-Square-Error for  Subset Selection in Linear \n\nRegression \n\nTheorem  3:  In linear regression learning where Yk  represents the best linear regres(cid:173)\nsion  defined  over  all  possible  subsets  of  k  regression  variables,  the  mean  squared \nerror  (MSE)  is  first-order  convex as a function  of the complexity k. \nProol:  We  use  I(Yk(xi))  = (Yi  - ydXi))2  which  is  a  convex  function  of Yk .  The \ncorresponding loss function  E(Yk)  becomes the mean-square-error and is first-order \nconvex as a function  of the complexity k  by the proof of Theorem 1. \n\nCorollary  2:  If a  concave or linear penalty term in k  is  added to the mean squared \nerror as  defined  in Theorem 3,  then the resulting penalized  mean-square-error can \nhave at most one minimum to within first  order.  Such penalty terms include  Mal(cid:173)\nlow's Cp  criterion, AIC, BIC, predicted squared error, etc., (e.g., see Bishop (1995)). \n\n6  Experimental Results \n\nIn  this  section  we  demonstrate  empirical  evidence  of the  approximate  concavity \nproperty on three different  data sets  with  model  families  and loss functions  which \nsatisfy the assumptions stated earlier: \n1.  Mixtures  01  Gaussians:  3962  data points in  2 dimensions,  representing the first \ntwo  principal  components  of  historical  geopotential  data from  upper-atmosphere \ndata records,  were fit  with  a mixture of k  Gaussian components,  k  varying from  1 \nto 20  (see Smyth, Ide, and Ghil (1999) for more discussion ofthis data).  Figure 2(a) \nillustrates that the log-likelihood is  approximately concave as a function of k.  Note \nthat it is not completely concave.  This could be a result of either local maxima in the \nfitting process (the maximum likelihood solutions in the interior of parameter space \nwere  selected as the best obtained by EM from  10 different randomly chosen initial \nconditions),  or may indicate that concavity cannot  be proven beyond  a  first-order \ncharacterization in the general case. \n\n2.  Mixtures  01  Markov  Chains:  Page-request  sequences  logged  at  the  msnbc. com \nWeb site over a 24-hour period from over 900,000 individuals were fit  with mixtures \nof first-order  Markov chains  (see  Cadez et al.  (2000)  for  further  details).  Figure 1 \nagain  clearly shows  a  concave  characteristic for  the  log-likelihood  as  a  function  of \nk,  the number of Markov components in the model. \n3.  Subset Selection  in  Linear Regression:  Autoregressive  (AR)  linear  models  were \nfit  (closed form  solutions for  the optimal model parameters) to a monthly financial \ntime series with 307 observations, for  all possible combinations of lags  (all possible \n\n\f'66 , .. \n\n'58 \n\nNwnbcr of Mixture Components rt] \n\nNumber of Regressloo Van ables  [It] \n\nFigure 2:  (a) In-sample log-likelihood for mixture modeling of the atmospheric data \nset,  (b)  mean-squared error for  regression using the financial  data set. \n\nsubsets) from order k = 1 to order k  = 12.  For example, the k = 1 model represents \nthe best model with a single predictor from the previous 12 months, not necessarily \nthe  AR(l)  model.  Again  the  goodness-of-fit  curve  is  almost  convex  in  k  (Figure \n2(b\u00bb,  except  at  k  = 9  where  there  is  a  slight  non-concavity:  this  could  again  be \neither a  numerical estimation effect  or a fundamental characteristic indicating that \nconcavity is  only true to first-order. \n\n7  Discussion and  Conclusions \n\nSpace  does  not  permit  a  full  discussion  of the  various  implications  of the  results \nderived  here.  The main  implication  is  that for  at least  two  common learning sce(cid:173)\nnarios the maximizing/minimizing value of the loss function is strongly constrained \nas model complexity is varied.  Thus, for example, when performing model selection \nusing penalized goodness-of-fit (as in the Corollaries above) variants of binary search \nmay be  quite  useful  in  problems  where  k  is  very large  (in  the mixtures of Markov \nchains above it is not necessary to fit the model for all values of k, i.e., we can simply \ninterpolate  within  first-order).  Extensions  to  model  selection  using  loss-functions \ndefined on out-of-sample test data sets can also be derived, and can be carried over \nunder appropriate assumptions to cross-validation.  Note that the results described \nhere  do  not  have  an  obvious  extension to non-linear  models  (such  as feed-forward \nneural networks)  or loss-functions such as the 0/1 loss for  classification. \n\nReferences \n\nBishop,  C.,  Neural  Networks  for  Pattern  Recognition,  Oxford  University  Press, \n\n1995, pp.  376- 377. \n\nCadez,  1.,  D.  Heckerman,  C.  Meek,  P.  Smyth,  and  S.  White,  'Visualization  of \nnavigation patterns on a Web  site using model-based clustering,' Technical \nReport MS-TR-00-18, Microsoft  Research, Redmond,  WA. \n\nLi,  Jonathan Q.,  and Barron, Andrew A.,  'Mixture density estimation,' presented \n\nat  NIPS  99. \n\nSmyth,  P.,  K. Ide,  and M.  Ghil,  'Multiple regimes in  Northern hemisphere height \nfields  via mixture model  clustering,'  Journal  of the  Atmospheric  Sciences, \nvol.  56,  no.  21,  3704- 3723, 1999. \n\n\f", "award": [], "sourceid": 1865, "authors": [{"given_name": "Igor", "family_name": "Cadez", "institution": null}, {"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}