{"title": "The Use of MDL to Select among Computational Models of Cognition", "book": "Advances in Neural Information Processing Systems", "page_first": 38, "page_last": 44, "abstract": null, "full_text": "The Use of MDL to Select among \n\nComputational Models of Cognition \n\nIn J. Myung, Mark A. Pitt & Shaobo Zhang Vijay Balasubramanian \n\nDepartment of Psychology \n\nDavid Rittenhouse Laboratories \n\nOhio State University \nColumbus, OH 43210 \n\nUniversity of Pennsylvania \nPhiladelphia, PA 19103 \n\n{myung.l, pitt.2}@osu.edu \n\nvijay@endiv.hep.upenn.edu \n\nAbstract \n\nHow should we decide among competing explanations of a \ncognitive process given limited observations? The problem of \nmodel selection is at the heart of progress in cognitive science. In \nthis paper, Minimum Description Length (MDL) is introduced as a \nmethod for selecting among computational models of cognition. \nWe also show that differential geometry provides an intuitive \nunderstanding of what drives model selection in MDL. Finally, \nadequacy of MDL is demonstrated in two areas of cognitive \nmodeling. \n\n1 Model Selection and Model Complexity \n\nThe development and testing of computational models of cognitive processing are a \ncentral focus in cognitive science. A model embodies a solution to a problem whose \nadequacy is evaluated by its ability to mimic behavior by capturing the regularities \nunderlying observed data. This enterprise of model selection is challenging because \nof the competing goals that must be satisfied. Traditionally, computational models \nof cognition have been compared using one of many goodness-of-fit measures. \nHowever, use of such a measure can result in the choice of a model that over-fits the \ndata, one that captures idiosyncracies in the particular data set (i.e., noise) over and \nabove the underlying regularities of interest. Such models are considered complex, \nin that the inherent flexibility in the model enables it to fit diverse patterns of data. \nAs a group, they can be characterized as having many parameters that are combined \nin a highly nonlinear fashion in the model equation. They do not assume a single \nstructure in the data. Rather, the model contains multiple structures; each obtained \nby finely tuning the parameter values of the model, and thus can fit a wide range of \ndata patterns. In contrast, simple models, frequently with few parameters, assume a \nspecific structure in the data, which will manifest itself as a narrow range of similar \ndata patterns. Only when one of these patterns occurs will the model fit the data \nwell. \n\nThe problem of over-fitting data due to model complexity suggests that the goal of \nmodel selection should instead be to select the model that generalizes best to all data \nsamples that arise from the same underlying regularity, thus capturing only the \nregularity, not the noise. To achieve this goal, the selection method must be \nsensitive to the complexity of a model. There are at least two independent \ndimensions of model complexity. They are the number of free parameters of a \n\n\fmodel and its functional form, which refers to the way the parameters are combined \nin the model equation. For instance, it seems unlikely that two one-parameter \nmodels, y = ex and y = x9, are equally complex in their ability to fit data. The two \ndimensions of model complexity (number of parameters and functional form) and \ntheir interplay can improve a model's fit to the data, without necessarily improving \ngeneralizability. \n\nThe trademark of a good model selection procedure, then, is its ability to satisfy two \nopposing goals. A model must be sufficiently complex to describe the data sample \naccurately, but without over-fitting the data and thus losing generalizability. To \nachieve this end, we need a theoretically well-justified measure of model complexity \nthat takes into account the number of parameters and the functional form of a model. \nIn this paper, we introduce Minimum Description Length (MDL) as an appropriate \nmethod of selecting among mathematical models of cognition. We also show that \nMDL has an elegant geometric interpretation that provides a clear, intuitive \nunderstanding of the meaning of complexity in MDL. Finally, application examples \nof MDL are presented in two areas of cognitive modeling. \n\n1.1 Minimum Description Length \n\nThe central thesis of model selection is the estimation of a model's generalizability. \nOne approach to assessing generalizability is the Minimum Description Length \n(MDL) principle [1]. It provides a \ntheoretically well-grounded measure of \ncomplexity that is sensitive to both dimensions of complexity and also lends itself to \nintuitive, geometric interpretations. MDL was developed within algorithmic coding \ntheory to choose the model that permits the greatest compression of data. A model \nfamily f with parameters e assigns the likelihood f(yle) to a given set of observed \ndata y . The full form of the MDL measure for such a model family is given below. \n\nMDL = -In! (yISA) + ~ln( ; ) + In f dS.jdetl(S) \n\nwhere SA is the parameter that maximizes the likelihood, k is the number of \nparameters in the model, N is the sample size and I(e) is the Fisher information \nmatrix. MDL is the length in bits of the shortest possible code that describes the \ndata with the help of a model. In the context of cognitive modeling, the model that \nminimizes MDL uncovers the greatest amount of regularity (i.e., knowledge) \nunderlying the data and therefore should be selected. The first, maximized log \nlikelihood term is the lack-of-fit measure, and the second and third terms constitute \nthe intrinsic complexity of the model. In particular, the third term captures the \neffects of complexity due to functional form, reflected through I(e). We will call the \nlatter two terms together the geometric complexity of the model, for reasons that \nwill become clear in the remainder of this paper. \n\nMDL arises as a finite series of terms in an asymptotic expansion of the Bayesian \nposterior probability of a model given the data for a special form of the parameter \nprior density [2] . Hence in essence, minimization of MDL is equivalent to \nmaximization of the Bayesian posterior probability. In this paper we present a \ngeometric interpretation of MDL, as well as Bayesian model selection [3], that \nprovides an elegant and intuitive framework for understanding model complexity, a \ncentral concept in model selection. \n\n2 Differential Geometric Interpretation of MDL \n\nFrom a geometric perspective, a parametric model \nfamily of probability \ndistributions forms a Riemannian manifold embedded in the space of all probability \n\n\fdistributions [4]. Every distribution is a point in this space, and the collection of \npoints created by varying the parameters of the model gives rise to a hyper-surface \nin which \"similar\" distributions are mapped to \"nearby\" points. The infinitesimal \ndistance between points separated by the infinitesimal parameter differences de; is \ngiven by ds 2 = Y' k. g .. (8 )d8 ; d8 j where g ij(e) is the Riemannian metric tensor. The \nFisher information, lij(e), is the natural metric on a manifold of distributions in the \ncontext of statistical inference [4]. We argue that the MDL measure of model fitness \nhas an attractive interpretation in such a geometric context. \n\nI.... l,j = l \n\nlJ \n\nThe first term in MDL estimates the accuracy of the model since the likelihood \n) measures the ability of the model to fit the observed data. The second and \nf(yI8 A\nthird terms are supposed to penalize model complexity; we will show that they have \ninteresting geometric interpretations. Given the metric gij = lij on the space of \nparameters, the infinitesimal volume element on the parameter manifold is \ndV = d8 .Jdetl (8) == rt=l d8 i .Jdetl (8) . The Riemannian volume of the parameter \n\nmanifold is obtained by integrating dV over the space of parameters: \n\nVM = f dV = f dS..jdetl(S) \n\nIn other words, the third term in MDL penalizes models that occupy a large volume \nin the space of distributions. \n\nIn fact, the volume measure VM is related to the number of \"distinguishable\" \nprobability distributions indexed by the model M.l Because of the way the model \nfamily is embedded in the space of distributions, two different parameter values can \nindex very similar distributions. If complexity is related to volumes occupied by \nmodel manifolds, \nthe measure of volume should count only different, or \ndistinguishable, distributions, and not the artificial coordinate volume. It is shown in \n[2,5] that the volume VM achieves this goal? \n\nWhile the third term in MDL measures the total volume of distributions a model can \ndescribe, the second term relates to the number of model distributions that lie close \nto the truth. To see this, taking a Bayesian perspective on model selection is helpful. \nU sing Bayes rule, the probability that the truth lies in the family f given the \nobserved data y can be written as: \n\nPr(fly) = A(f,y){ dB w(S)Pr(yI9) \n\nHere wee) is the prior probability of the parameter e, and A(f, y) = Pr(f)/Pr(y) is the \nratio of the prior probabilities of the family f and data y. Bayesian methods assume \nthat the latter are the same for all models under consideration and analyze the so(cid:173)\ncalled Bayesian posterior \n\nPf = I de w(9)Pr(yI9)' \n\nLacking prior knowledge, w should be chosen to weight all distinguishable \ndistributions in the family equally. Hence, wee) = lIVM . For large sample sizes, the \n) localizes under general conditions to a multivariate \nlikelihood function f(yI8A\n\n1 Roughly speaking, two probability distributions are considered indistinguishable if one \nis mistaken for the other even in the presence of an infinite amount of data. A careful \ndefinition of distinguishability involves use of the Kullback-Leibler distance between \ntwo probability distributions. For further details, see [3,4]. \n2 Note that the parameters of the model are always assumed to be cut off in a manner to \nensure that VM is finite. \n\n\fGaussian centered at the maximum likelihood parameter e' (see [3,4] and citations \ntherein). In this limit, the integral for Pj can be explicitly carried out. Performing the \nintegral and taking a log given the result \n\n- In Pf = -lnf(yIS') + In(V M / CM )+ 0(1/ N) where C M = (21t / N)k /2 h(S') \n\nwhere h(e') is a data-dependent factor that goes to 1 for large N when the truth lies \nwithinf (see [3,4] for details). CM is essentially the volume of an ellipsoidal region \naround the Gaussian peak at f(yle') where the integrand of the Bayesian posterior \nthe number of \nmakes a substantial contribution. \ndistinguishable distributions within f that lie close to the truth. \nUsing the expressions for CM and VM , the MDL selection criterion can be written as \n\nIn effect, CM measures \n\nMDL = - In f (yle') + In(V M / C M) + terms sub leading in N \n\n(The subleading terms include the contribution of h(e'); see [3,4] regarding its role \nin Bayesian inference.) The geometric meaning of the complexity penalty in MDL \nnow becomes clear; models which occupy a relatively large volume distant from the \ntruth are penalized. Models that contain a relatively large fraction of distributions \nlying close to the truth are preferred. Therefore, we refer to the last two terms in \nMDL as geometric complexity. It is also illuminating to collect terms in MD as \n\nMDL = -In[ ( f (yle') ): = -In('' normalized maximized likelihood\") \n\nVM /C M \n\nWritten this way, MDL selects the model that gives the highest value of the \nmaximum likelihood, per the relative ratio of distinguishable distributions (VMICM). \nFrom this perspective, a better model is simply one with many distinguishable \ndistributions close to the truth, but few distinguishable distributions overall. \n\n3 Application Examples \n\nGeometric complexity and MDL constitute a powerful pair of model evaluation \ntools. When used together in model testing, a deeper understanding of the \nrelationship between models can be gained. The first measure enables one to assess \nthe relative complexities of the set of models under consideration. The second \nbuilds on the first by suggesting which model is preferable given the data in hand. \nThe following simulations demonstrate the application of these methods in two \nareas of cognitive modeling: information integration, and categorization. In each \nexample, two competing models were fitted to artificial data sets generated by each \nmodel. Of interest is the ability of a selection method to recover the model that \ngenerated the data. MDL is compared with two other selection methods, both of \nwhich consider the number of parameters only. They are the Akaike Information \nCriterion (AIC ; [6]) and the Bayesian Information Criterion (BIC; [7]) defined as: \nAlC= -2 In !CyIS')+ 2k; HIC= -21n!CyIS')+ klnN. \n\n3.1 \n\nInformation Integration \n\nIn a typical information integration experiment, a range of stimuli are generated \nfrom a factorial manipulation of two or more stimulus dimensions (e.g.\" visual and \nauditory) and then presented to participants for categorization as one of two or more \npossible response alternatives. Data are scored as the proportion of responses in one \n\n\fthe various combinations of stimulus dimensions. For this \ncategory across \ncomparison, we consider two models of information integration, the Fuzzy Logical \nModel of Perception (FLMP; [8]) and the Linear Integration Model (LIM; [9]). Each \nassumes that the response probability (Pij) of one category, say A, upon the \npresentation of a stimulus of the specific i and j feature dimensions in a two-factor \ninformation integration experiment takes the following form: \n\n8j + Aj \nFLMP: Pjj = 8jAj + (1- 8)(1- Aj ); LIM: Pjj = -2 -\n\n8)\"j \n\nwhere ei and ej (i=I, .. ,q] ; j=I, .. ,q2\u00b7 0 < ei, ej < I) are parameters representing the \ncorresponding feature dimensions. The simulation results are shown in Table 1. \n\nWhen the data were generated by FLMP, regardless of the selection method used, \nFLMP was recovered 100% of the time. This was true across all selection methods \nand across both sample sizes, except for MDL when sample size was 20. In this \ncase, MDL did not perform quite as well as the other selection methods. When the \ndata were generated by LIM, AIC or BIC fared much more poorly whereas MDL \nrecovered the correct model (LIM) across both sample sizes. Specifically, under \nAIC or BIC, FLMP was selected over LIM half of the time for N = 20 (51 % vs. \n49%), though such errors were reduced for N = 150 (17 % vs 83 %). \n\nT bl 1 M d I R \n\na e \n\no e \n\necovery R ates or wo n ormatIOn ntegratIOn \n\nf T \n\nI f \n\nI \n\no e s \nM d I \n\nSample \nSize \n\nSelection Data \nMethod \n\nfrom: \nModel fitted: \n\nFLMP \n\nLIM \n\nAIC/BIC \n\nFLMP \n\nN = 20 \n\nMDL \n\nLIM \n\nFLMP \n\nLIM \n\nAIC/BIC \n\nFLMP \n\nN = 150 \n\nMDL \n\nLIM \n\nFLMP \n\nLIM \n\n100% \n\n0% \n\n89 % \n\n11 % \n\n100% \n\n0% \n\n100% \n\n0% \n\n51 % \n\n49 % \n\n0% \n\n100% \n\n17% \n\n83 % \n\n0% \n\n100% \n\nThat FLMP is selected over LIM when a method such as AIC was used, even when \nthe data were generated by LIM, suggests that FLMP is more complex than LIM. \nThis observation was confirmed when the geometric complexity of each model was \ncalculated. The difference in geometric complexity between FLMP and LIM was \n8.74, meaning that for every distinguishable distribution for which LIM can \naccount, FLMP can describe about e8 .74 == 6248 distinguishable distributions. \nObviously, this difference in complexity between the two models must be due to the \nfunctional form because they have the same number of parameters. \n\n3.2 Categorization \n\nTwo models of categorization were considered in the present demonstration. They \nwere the generalized context model (GCM: [10]) and the prototype model (PRT: \n[11]). Each model assumes that categorization responses follow a multinomial \nprobability distribution with Pii (probability of category C] response given stimulus \nXi), which is given by \n\n\f~ S .\u00b7 \ni.... j ee} \nGCM: Pu = ~ ~ \n\nIJ \n\nI... K I... keCK Sik \n\n; PRT: Pu = ~ \n\nS \n\nif \n\nI... K SiK \n\nIn the equation, sij is a similarity measure between multidimensional stimuli Xi and \nXj , SiJ is a similarity measure between stimulus Xi and the prototypic stimulus X j of \ncategory Cj \u2022 Similarity is measured using the Minkowski distance metric with the \nmetric parameter r. The two models were fitted to data sets generated by each model \nusing the six-dimensional scaling solution from Experiment 1 of [12] under the \nEuclidean distance metric of r = 2. \n\nAs shown in Table 2, under AIC or SIC, a relatively small bias toward choosing \nGCM was found using data generated from PRT when N = 20. When MDL was \nused to choose between the two models, there was improvement over AIC in \ncorrecting the bias. In the larger sample size condition, there was no difference in \nmodel recovery rate between AIC and MDL. This outcome contrasts with that of the \npreceding example, in which MDL was generally superior to the other selection \nmethods when sample size was smallest. \n\nT bl 2 M d I R \n\na e \n\no e \n\necovery R \n\nates or wo ategoflzatlOn M d I \no e s \n\nfTC \n\nSample \nSize \n\nN = 20 \n\nN = 150 \n\nSelection Data \nMethod \n\nModel fitted: \n\nfrom: GCM \n\nAIC/SIC \n\nMDL \n\nAIC/SIC \n\nMDL \n\nGCM \n\nPRT \n\nGCM \n\nPRT \n\nGCM \n\nPRT \n\nGCM \n\nPRT \n\n98 % \n\n2% \n\n96 % \n\n4% \n\n99 % \n\n1% \n\n99 % \n\n1% \n\nPRT \n\n15 % \n\n85 % \n\n7% \n\n93 % \n\n1% \n\n99 % \n\n1% \n\n99 % \n\nOn the face of it, these findings would suggest that MDL is not much better than the \nother selection methods. After all, what else could cause this result? The only \ncircumstances in which such an outcome is predicted under MDL is when the \nfunctional forms of the two models are similar (recall that the models have the same \nnumber of parameters), thus minimi zing the differential contribution of functional \nform in the complexity term. Calculation of the geometric complexity of each model \nconfirmed this suspicion. GCM is indeed only slightly more complex than PRT, the \ndifference being equal to 0.60, so GCM can describe about two distributions (eO. 60 == \n1.8) for every distribution PRT can describe. \n\nThese simulation results together demonstrate usefulness of MDL and the geometric \ncomplexity measure in testing models of cognition. MDL's sensitivity to functional \nform was clearly demonstrated in its superior model recovery rate, especially when \nthe complexities of the models differed by a nontrivial amount. \n\n4 Conclusion \n\nModel selection in cognitive science can proceed far more confidently with a clear \nunderstanding of why one model should be preferred over another. A geometric \n\n\fthat MDL, along with \n\ninterpretation of MDL helps to achieve this goal. The work carried out thus far \nindicates \nthe geometric complexity measure, holds \nconsiderable promise in evaluating computational models of cognition. MDL \nchooses the correct model most of the time, and geometric complexity provides a \nmeasure of how different the two models are in their capacity or power. Future work \nis directed toward extending this approach to other classes of models, such as \nconnectionist networks. \n\nAcknowledgment and Authors Note \n\nM.A.P. and U.M. were supported by NIMH Grant MH57472. V.B. was supported \nby the Society of Fellows and the Milton Fund of Harvard University, by NSF grant \nNSF-PHY-9802709 and by the DOE grant DOE-FG02-95ER40893 . The present \nwork is based in part on [5] and [13]. \n\nReferences \n\n[1] Rissanen, J. (1996) Fisher information and stochastic complexity. IEEE \nTransaction on Information Theory, 42, 40-47 . \n\n[2] Balasubramanian, V. (1997) Statistical inference, Occam's razor and statistical \nmechanics on the space of probability distributions. Neural Computation, 9, 349-\n368. \n[3] MacKay, D. J. C. (1992). Bayesian interpolation. Neural Computation, 4, 415-\n447. \n\n[4] Amari, S. I. (1985) Differential Geometrical Methods in Statistics. Springer(cid:173)\nVerlag. \n\n[5] Myung, I. J., Balasubramanian, V., & Pitt, M. A. (1999) Counting probability \ndistributions: Differential geometry and model selection. Proceedings of the \nNational Academy of Sciences USA, 97, 11170-11175. \n\n[6] Akaike, H. (1973) Information theory and an extension of the maximum \nlikelihood principle, in B. N. Petrox and F. Caski, Se cond international symposium \non information theory, pp. 267-281. Akademiai Kiado, Budapest. \n\n[7] Schwarz, G. (1978) Estimating the dimension of a model. The Annals of \nStatistics, 6, 461-464. \n\n[8] Oden, G. C., & Massaro, D. W. (1978) Integration of featural information in \nspeech perception. Psychological Review, 85,172-191. \n\n[9] Anderson, N. H. (1981) Foundations of Information Integration Theory . \nAcademic Press. \n\n(1986) Attention, similarity and \n\n[10] Nosofsky, R. M. \nidentification(cid:173)\ncategorization relationship. Journal of Experimental Psychology: General, 115, 39-\n57. \n[11] Reed, S. K. (1972) Pattern recognition and categorization. Cognitive \nPsychology, 3,382-407. \n\nthe \n\n[12] Shin, H. J., & Nosofsky, R. M. (1992) Similarity-scaling studies of dot-patten \nclassification and recognition. Journal of Experimental Psychology: General, 121, \n278-304. \n\n[13] Pitt, M. A., Myung, I. J., & Zhang, S. (2000). Toward a method of selecting \namong computational models of cognition. Submitted for publication. \n\n\f", "award": [], "sourceid": 1897, "authors": [{"given_name": "In", "family_name": "Myung", "institution": null}, {"given_name": "Mark", "family_name": "Pitt", "institution": null}, {"given_name": "Shaobo", "family_name": "Zhang", "institution": null}, {"given_name": "Vijay", "family_name": "Balasubramanian", "institution": null}]}