{"title": "Model Complexity, Goodness of Fit and Diminishing Returns", "book": "Advances in Neural Information Processing Systems", "page_first": 388, "page_last": 394, "abstract": null, "full_text": "Model Complexity, Goodness of Fit and \n\nDiminishing Returns \n\nIgor V. Cadez \n\nPadhraic Smyth \n\nInformation and Computer Science \n\nInformation and Computer Science \n\nUniversity of California \n\nIrvine, CA 92697-3425, U.S.A. \n\nUniversity of California \n\nIrvine, CA 92697-3425, U.S.A. \n\nAbstract \n\nWe investigate a general characteristic of the trade-off in learning \nproblems between goodness-of-fit and model complexity. Specifi(cid:173)\ncally we characterize a general class of learning problems where the \ngoodness-of-fit function can be shown to be convex within first(cid:173)\norder as a function of model complexity. This general property \nof \"diminishing returns\" is illustrated on a number of real data \nsets and learning problems, including finite mixture modeling and \nmultivariate linear regression. \n\nIntroduction, Motivation, and Related Work \n\n1 \nAssume we have a data set D = {Xl, X2, ... , x n }, where the X i could be vectors, \nsequences, etc. We consider modeling the data set D using models indexed by a \ncomplexity index k, 1 :::; k :::; kmax \u2022 For example, the models could be finite mixture \nprobability density functions (PDFs) for vector Xi'S where model complexity is \nindexed by the number of components k in the mixture. Alternatively, the modeling \ntask could be to fit a conditional regression model y = g(Zk) + e, where now y is \none of the variables in the vector X and Z is some subset of size k of the remaining \ncomponents in the X vector. \n\nSuch learning tasks can typically be characterized by the existence of a model and \na loss function. A fitted model of complexity k is a function of the data points D \nand depends on a specific set of fitted parameters B. The loss function (goodness(cid:173)\nof-fit) is a functional of the model and maps each specific model to a scalar used \nto evaluate the model, e.g., likelihood for density estimation or sum-of-squares for \nregression. \n\nFigure 1 illustrates a typical empirical curve for loss function versus complexity, for \nmixtures of Markov models fitted to a large data set of 900,000 sequences. The \ncomplexity k is the number of Markov models being used in the mixture (see Cadez \net al. \n(2000) for further details on the model and the data set). The empirical \ncurve has a distinctly concave appearance, with large relative gains in fit for low \ncomplexity models and much more modest relative gains for high complexity models. \nA natural question is whether this concavity characteristic can be viewed as a \ngeneral phenomenon in learning and under what assumptions on model classes and \n\n\fNwnber of M Ixture Cmnponen1S 11] \n\nFigure 1: Log-likelihood scores for a Markov mixtures data set. \n\nloss functions the concavity can be shown to hold. The goal of this paper is to \nillustrate that in fact it is a natural characteristic for a broad range of problems in \nmixture modeling and linear regression. \n\nWe note of course that for generalization that using goodness-of-fit alone will lead \nto the selection of the most complex model under consideration and will not in \ngeneral select the model which generalizes best to new data. Nonetheless our pri(cid:173)\nmary focus of interest in this paper is how goodness-of-fit loss functions (such as \nlikelihood and squared error, defined on the training data D) behave in general as a \nfunction of model complexity k. Our concavity results have a number of interesting \nimplications. For example, for model selection methods which add a penalty term \nto the goodness-of-fit (e.g., BIC), the resulting score function as a function of model \ncomplexity will be unimodal as a function of complexity k within first order. \n\nLi and Barron (1999) have shown that for finite mixture models the expected value \nof the log-likelihood for any k is bounded below by a function of the form -C /k \nwhere C is a constant which is independent of k. The results presented here are \ncomplementary in the sense that we show that the actual maximizing log-likelihood \nitself is concave to first-order as a function of k. Furthermore, we obtain a more \ngeneral principle of \"diminishing returns,\" including both finite mixtures and subset \nselection in regression. \n\n2 Notation \n\nWe define y = y(x) as a scalar function of x, namely a prediction at x. In linear \nregression y = y(x) is a linear function of the components in x while in density \nestimation y = y(x) is the value of the density function at x. Although the goals \nof regression and density estimation are quite different, we can view them both as \nsimply techniques for approximating an unknown true function for different values \nof x. We denote the prediction of a model of complexity k as Yk (xIB) where the \nsubscript indicates the model complexity and B is the associated set of fitted param(cid:173)\neters. Since different choices of parameters in general yield different models, we will \ntypically abbreviate the notation somewhat and use different letters for different \nparameterizations of the same functional form (i.e., the same complexity), e.g., we \nmay use Yk(X),gk(X), hk(X) to refer to models of complexity k instead of specifying \nYk(xIBd,Yk(xIB2 ),Yk(xIB3 ), etc. Furthermore, since all models under discussion are \nfunctions of x, we sometimes omit the explicit dependence on x and use a compact \nnotation Yk, 9k, hk\u00b7 \nWe focus on classes of models that can be characterized by more complex models \nhaving a linear dependence on simpler models within the class. More formally, any \n\n\fmodel of complexity k can be decomposed as: \n\nYk = a191 + a2h1 + ... + ak W 1\u00b7 \n\n(1) \nIn PDF mixture modeling we have Y k = p( x) and each model 91, hI, .. . ,Zl is a basis \nPDF (e.g., a single Gaussian) but with different parameters. In multivariate linear \nregression each model 91, hI, ... ,WI represents a regression on a single variable, e.g., \n91(X) above is 91(X) = 'Ypxp where xp is the p-th variable in the set and 'Yp is the \ncorresponding coefficient one would obtain if regressing on xp alone. One of the \n91, hI, ... ,WI can be a dummy constant variable to account for the intercept term. \nNote that the total parameters for the model Yk in both cases can be viewed as \nconsisting of both the mixing proportions (the a's) and the parameters for each \nindividual component model. \n\nThe loss function is a functional on models and we write it as E(Yk). For simplicity, \nwe use the notation EZ to specify the value of the loss function for the best k(cid:173)\ncomponent model. This way, EZ :S E(Yk) for any model Yk1. For example, the loss \nfunction in PDF mixture modeling is the negative log likelihood. In linear regression \nwe use empirical mean squared error (MSE) as the loss function. The loss functions \nof general interest in this context are those that decompose into a sum of functions \nover data points in the data set D (equivalently an independence assumption in a \nlikelihood framework), i.e., \n\nn \n\n(2) \n\nFor example, in PDF mixture modeling !(Yk) = -In Yk, while in regression model(cid:173)\ning !(Yk) = (y - Yk)2 where Y is a known target value. \n\ni=l \n\n3 Necessary Conditions on Models and Loss Functions \n\nWe consider models that satisfy several conditions that are commonly met in real \ndata analysis applications and are satisfied by both PDF mixture models and linear \nregression models: \n\n1. As k increases we have a nested model class, i.e., each model of complexity \nk contains each model of complexity k' < k as a special case (i.e., it reduces \nto a simpler model for a special choice of the parameters). \n\n2. Any two models of complexities k1 and k2 can be combined as a weighted \n\nsum in any proportion to yield a valid model of complexity k = k1 + k2. \n\n3. Each model of complexity k = k1 + k2 can be decomposed into a weighted \nsum of two valid models of complexities k1 and k2 respectively for each \nvalid choice of k1 and k2. \n\nThe first condition guarantees that the loss function is a non-increasing function of \nk for optimal models of complexity k (in sense of minimizing the loss function E), \nthe second condition prevents artificial correlation between the component models, \nwhile the third condition guarantees that all components are of equal expressive \npower. As an example, the standard Gaussian mixture model satisfies all three \nproperties whether the covariance matrices are unconstrained or individually con(cid:173)\nstrained. As a counter-example, a Gaussian mixture model where the covariance \nmatrices are constrained to be equal across all components does not satisfy the \nsecond property. \n\nlWe assume the learning task consists of minimization of the loss function. If maxi(cid:173)\n\nmization is more appropriate , we can just consider minimization of the negative of the loss \nfunction. \n\n\f4 Theoretical Results on Loss Function Convexity \n\nWe formulate and prove the following theorem: \nTheorem 1: In a learning problem that satisfies the properties from Section 3, the \nloss function is first order convex in model complexity k, meaning that EZ+1 - 2EZ + \nEZ_ 1 ~ 0 within first order (as defined in the proof). The quantities EZ and EZ\u00b1l \nare the values of the loss function for the best k and k \u00b1 I-component models. \nProal: In the first part of the proof we analyze a general difference of loss functions \nand write it in a convenient form. Consider two arbitrary models, 9 and hand \nthe corresponding loss functions E(g) and E(h) (g and h need not have the same \ncomplexity). The difference in loss functions can be expressed as: \n\nE(g) - E(h) \n\nn \nL {I [g(Xi)] - I [h(Xi)]} \ni=l \nn \nL {I [h(xi)(1 + Jg ,h(Xi))]- I [h(Xi)]} \ni=l \nn \n\n= a L h(Xi)!' (h(Xi)) Jg ,h(Xi). \n\ni=l \n\n(3) \n\nwhere the last equation comes from a first order Taylor series expansion around each \nJg ,h(Xi) = 0, a is an unknown constant of proportionality (to make the equation \nexact) and \n\nJ () -=- g(x) - h(x) \ng,h X \n\nh(x) \n\n-\n\n(4) \n\nrepresents the relative difference in models 9 and h at point x. For example, Equa(cid:173)\ntion 3 reduces to a first order Taylor series approximation for a = 1. If I (y) is a \nconvex function we also have: \n\nE(g) - E(h) ~ L h(Xi)!'(h(Xi))Jg,h(Xi). \n\n(5) \n\nn \n\ni=l \n\nsince the remainder in the Taylor series expansion R2 = I/2f\"(h(I + 8J))J2 ~ O. \nIn the second part of the proof we use Equation 5 to derive an appropriate condi(cid:173)\ntion on loss functions. Consider the best k and k \u00b1 I-component models and the \nappropriate difference of the corresponding loss functions EZ+1 - 2EZ + EZ_ 1 , which \nwe can write using the notation from Equation 3 and Equation 5 (since we consider \nconvex functions I(y) = -lny for PDF modeling and I(y) = (y - Yi)2 for best \nsubset regression) as: \n\nEZ+1 - 2EZ + EZ_ 1 = \n\nn \n\ni =l \n\nn \n\nn \n\ni=l \nn \n\n> LyZ(Xi)!'(yZ(Xi))JYZ+1 ,YZ(Xi) + L yZ(Xi)!'(yZ(Xi))JyZ_1,YZ (Xi) \n\ni=l \n\ni=l \n\nn \n\n= LyZ(Xi)!'(yZ(Xi)) [JYZ+1 ,YZ(Xi) + JYZ_1,YZ(Xi)] . \n\ni=l \n\n(6) \n\n\fAccording to the requirements on models in Section 3, the best k + I-component \nmodel can be decomposed as \n\nY'k+1 = (1 - E)gk + Eg1, \n\nwhere gk is a k-component model and gl is a I-component modeL Similarly, an \nartificial model can be constructed from the best k -\nek = (1 - E)Y'k-1 + Eg1\u00b7 \n\nI-component model: \n\nUpon subtracting y'k from each of the equations and dividing by Y'k, using notation \nfrom Equation 4, we get: \n\n(1 - E)09k ,y;' + EOg1 ,y;, \n(1- E)OY;'_l 'Y;' + EOg1 ,y;\" \nwhich upon subtraction and rearrangement of terms yields: \n\n= \n\no \u2022 \u2022 \nYk+1 'Yk \nO~k ,y;' \n\nk+l'k \n\nOy' \n\ny' + Oy' \n\ny' = (1 - E)09k y' + oCk y' + EOy' \n\n(7) \nIf we evaluate this equation at each of the data points Xi and substitute the result \nback into equation 6 we get: \nE'k+1 - 2E'k + E'k-1 ~ \n\ny\" \n1;:-I'k \n\n1;:-I'k \n\n' k \n\n~ ' k \n\nn \n\nLY'k (Xi)!' (Y'k(Xi)) [(1 - E)09k ,y;, (Xi) + O~k ,y;' (Xi) + EOy;'_l'Y;' (Xi )]' \n\n(8) \n\ni =l \n\nIn the third part of the proof we analyze each of the terms in Equation 8 using \nEquation 3. Consider the first term: \n\nn \n\nilgk ,y;, = LY'k(xd!'(Y'k(Xi))09k ,y;, (Xi ) \n\ni=l \n\n(9) \n\nthat depends on a relative difference of models gk and y'k at each of the data points \nXi . According to Equation 3, for small 09k ,Y;' (Xi) (which is presumably true), we can \nset a: :::::; 1 to get a first order Taylor expansion. Since y'k is the best k-component \nmodel, we have E(gk) ~ E(y'k) = Ek and consequently \n\n(10) \nNote that in order to have the last inequality hold, we do not require that a: :::::; 1, \nbut only that \n\nE(gk) - E(yk) = a:ilgk ,y;, :::::; ilgk ,y;, ~ 0 \n\n(11) \nwhich is a weaker condition that we refer to as the first order approximation. \nIn other words, we only require that the sign is preserved when making Taylor \nexpansion while the actual value need not be very accurate. Similarly, each of \nthe three terms on the right hand side of Equation 8 is first order positive since \nE(yk) ::; E(gk), E(ek), E(Y'k-1)' This shows that \n\na:~0 \n\nwithin first order, concluding the proof. \n\nEk+1 - 2Ek + E'k-1 ~ 0 \n\n5 Convexity in Common Learning Problems \n\nIn this section we specialize Theorem 1 to several well-known learning situations. \nEach proof consists of merely selecting the appropriate loss function E (y) and model \nfamily y. \n\n\f5.1 Concavity of Mixture Model Log-Likelihoods \n\nTheorem 2: In mixture model learning, using log-likelihood as the loss function and \nusing unconstrained mixture components, the in-sample log likelihood is a first(cid:173)\norder concave function of the complexity k. \n\nProol: By using I (y) = -In Y in Theorem 1 the loss function E(y) becomes the \nnegative of the in-sample log likelihood, hence it is a first-order convex function of \ncomplexity k, i.e., the log likelihood is first-order concave. \nCorollary 1: If a linear or convex penalty term in k is subtracted from the in-sample \nlog likelihood in Theorem 2, using the mixture models as defined in Theorem 2, then \nthe penalized likelihood can have at most one maximum to within first order. The \nBIC criterion satisfies this criterion for example. \n\n5.2 Convexity of Mean-Square-Error for Subset Selection in Linear \n\nRegression \n\nTheorem 3: In linear regression learning where Yk represents the best linear regres(cid:173)\nsion defined over all possible subsets of k regression variables, the mean squared \nerror (MSE) is first-order convex as a function of the complexity k. \nProol: We use I(Yk(xi)) = (Yi - ydXi))2 which is a convex function of Yk . The \ncorresponding loss function E(Yk) becomes the mean-square-error and is first-order \nconvex as a function of the complexity k by the proof of Theorem 1. \n\nCorollary 2: If a concave or linear penalty term in k is added to the mean squared \nerror as defined in Theorem 3, then the resulting penalized mean-square-error can \nhave at most one minimum to within first order. Such penalty terms include Mal(cid:173)\nlow's Cp criterion, AIC, BIC, predicted squared error, etc., (e.g., see Bishop (1995)). \n\n6 Experimental Results \n\nIn this section we demonstrate empirical evidence of the approximate concavity \nproperty on three different data sets with model families and loss functions which \nsatisfy the assumptions stated earlier: \n1. Mixtures 01 Gaussians: 3962 data points in 2 dimensions, representing the first \ntwo principal components of historical geopotential data from upper-atmosphere \ndata records, were fit with a mixture of k Gaussian components, k varying from 1 \nto 20 (see Smyth, Ide, and Ghil (1999) for more discussion ofthis data). Figure 2(a) \nillustrates that the log-likelihood is approximately concave as a function of k. Note \nthat it is not completely concave. This could be a result of either local maxima in the \nfitting process (the maximum likelihood solutions in the interior of parameter space \nwere selected as the best obtained by EM from 10 different randomly chosen initial \nconditions), or may indicate that concavity cannot be proven beyond a first-order \ncharacterization in the general case. \n\n2. Mixtures 01 Markov Chains: Page-request sequences logged at the msnbc. com \nWeb site over a 24-hour period from over 900,000 individuals were fit with mixtures \nof first-order Markov chains (see Cadez et al. (2000) for further details). Figure 1 \nagain clearly shows a concave characteristic for the log-likelihood as a function of \nk, the number of Markov components in the model. \n3. Subset Selection in Linear Regression: Autoregressive (AR) linear models were \nfit (closed form solutions for the optimal model parameters) to a monthly financial \ntime series with 307 observations, for all possible combinations of lags (all possible \n\n\f'66 , .. \n\n'58 \n\nNwnbcr of Mixture Components rt] \n\nNumber of Regressloo Van ables [It] \n\nFigure 2: (a) In-sample log-likelihood for mixture modeling of the atmospheric data \nset, (b) mean-squared error for regression using the financial data set. \n\nsubsets) from order k = 1 to order k = 12. For example, the k = 1 model represents \nthe best model with a single predictor from the previous 12 months, not necessarily \nthe AR(l) model. Again the goodness-of-fit curve is almost convex in k (Figure \n2(b\u00bb, except at k = 9 where there is a slight non-concavity: this could again be \neither a numerical estimation effect or a fundamental characteristic indicating that \nconcavity is only true to first-order. \n\n7 Discussion and Conclusions \n\nSpace does not permit a full discussion of the various implications of the results \nderived here. The main implication is that for at least two common learning sce(cid:173)\nnarios the maximizing/minimizing value of the loss function is strongly constrained \nas model complexity is varied. Thus, for example, when performing model selection \nusing penalized goodness-of-fit (as in the Corollaries above) variants of binary search \nmay be quite useful in problems where k is very large (in the mixtures of Markov \nchains above it is not necessary to fit the model for all values of k, i.e., we can simply \ninterpolate within first-order). Extensions to model selection using loss-functions \ndefined on out-of-sample test data sets can also be derived, and can be carried over \nunder appropriate assumptions to cross-validation. Note that the results described \nhere do not have an obvious extension to non-linear models (such as feed-forward \nneural networks) or loss-functions such as the 0/1 loss for classification. \n\nReferences \n\nBishop, C., Neural Networks for Pattern Recognition, Oxford University Press, \n\n1995, pp. 376- 377. \n\nCadez, 1., D. Heckerman, C. Meek, P. Smyth, and S. White, 'Visualization of \nnavigation patterns on a Web site using model-based clustering,' Technical \nReport MS-TR-00-18, Microsoft Research, Redmond, WA. \n\nLi, Jonathan Q., and Barron, Andrew A., 'Mixture density estimation,' presented \n\nat NIPS 99. \n\nSmyth, P., K. Ide, and M. Ghil, 'Multiple regimes in Northern hemisphere height \nfields via mixture model clustering,' Journal of the Atmospheric Sciences, \nvol. 56, no. 21, 3704- 3723, 1999. \n\n\f", "award": [], "sourceid": 1865, "authors": [{"given_name": "Igor", "family_name": "Cadez", "institution": null}, {"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}