{"title": "Measuring model complexity with the prior predictive", "book": "Advances in Neural Information Processing Systems", "page_first": 1919, "page_last": 1927, "abstract": "In the last few decades, model complexity has received a lot of press. While many methods have been proposed that jointly measure a model\u2019s descriptive adequacy and its complexity, few measures exist that measure complexity in itself. Moreover, existing measures ignore the parameter prior, which is an inherent part of the model and affects the complexity. This paper presents a stand alone measure for model complexity, that takes the number of parameters, the functional form, the range of the parameters and the parameter prior into account. This Prior Predictive Complexity (PPC) is an intuitive and easy to compute measure. It starts from the observation that model complexity is the property of the model that enables it to fit a wide range of outcomes. The PPC then measures how wide this range exactly is.", "full_text": "Measuring model complexity with the prior predictive\n\nWolf Vanpaemel \u2217\n\nDepartment of Psychology\n\nUniversity of Leuven\n\nwolf.vanpaemel@psy.kuleuven.be\n\nBelgium.\n\nAbstract\n\nIn the last few decades, model complexity has received a lot of press. While many\nmethods have been proposed that jointly measure a model\u2019s descriptive adequacy\nand its complexity, few measures exist that measure complexity in itself. More-\nover, existing measures ignore the parameter prior, which is an inherent part of the\nmodel and affects the complexity. This paper presents a stand alone measure for\nmodel complexity, that takes the number of parameters, the functional form, the\nrange of the parameters and the parameter prior into account. This Prior Predictive\nComplexity (PPC) is an intuitive and easy to compute measure. It starts from the\nobservation that model complexity is the property of the model that enables it to\n\ufb01t a wide range of outcomes. The PPC then measures how wide this range exactly\nis.\nkeywords: Model Selection & Structure Learning; Model Comparison Methods;\nPerception\n\nThe recent revolution in model selection methods in the cognitive sciences was driven to a large\nextent by the observation that computational models can differ in their complexity. Differences\nin complexity put models on unequal footing when their ability to approximate empirical data is\nassessed. Therefore, models should be penalized for their complexity when their adequacy is mea-\nsured. The balance between descriptive adequacy and complexity has been termed generalizability\n[1, 2].\nMuch attention has been devoted to developing, advocating, and comparing different measures of\ngeneralizability (for a recent overview, see [3]). In contrast, measures of complexity have received\nrelatively little attention. The aim of the current paper is to propose and illustrate a stand alone\nmeasure of model complexity, called the Prior Predictive Complexity (PPC). The PPC is based on\nthe intuitive idea that a complex model can predict many outcomes and a simple model can predict\na few outcomes only.\nFirst, I discuss existing approaches to measuring model complexity and note some of their limita-\ntions. In particular, I argue that currently existing measures ignore one important aspect of a model:\nthe prior distribution it assumes over the parameters. I then introduce the PPC, which, unlike the\nexisting measures, is sensitive to the parameter prior. Next, the PPC is illustrated by calculating the\ncomplexities of two popular models of information integration.\n\n1 Previous approaches to measuring model complexity\n\nA \ufb01rst approach to assess the (relative) complexity of models relies on simulated data. Simulation-\nbased methods differ in how these arti\ufb01cial data are generated. A \ufb01rst, atheoretical approach uses\nrandom data [4, 5]. In the semi-theoretical approach, the data are generated from some theoretically\n\n\u2217I am grateful to Michael Lee and Liz Bonawitz.\n\n1\n\n\finteresting functions, such as the exponential or the logistic function [4]. Using these approaches,\nthe models under consideration are equally complex if each model provides the best optimal \ufb01t to\nroughly the same number of data sets. A \ufb01nal approach to generating arti\ufb01cial data is a theoretical\none, in which the data are generated from the models of interest themselves [6, 7]. The parameter\nsets used in the generation can either be hand-picked by the researcher, estimated from empirical\ndata or drawn from a previously speci\ufb01ed distribution. If the models under consideration are equally\ncomplex, each model should provide the best optimal \ufb01t to self-generated data more often than the\nother models under consideration do.\nOne problem with this simulation-based approach is that it is very labor intensive. It requires gen-\nerating a large amount of arti\ufb01cial data sets, and \ufb01tting the models to all these data sets. Further,\nit relies on choices that are often made in an arbitrary fashion that nonetheless bias the results. For\nexample, in the semi-theoretical approach, a crucial choice is which functions to use. Similarly, in\nthe theoretical approach, results are heavily in\ufb02uenced by the parameter values used in generating\nthe data. If they are \ufb01xed, on what basis? If they are estimated from empirical data, from which\ndata? If they are drawn randomly, from which distribution? Further, a simulation study only gives a\nrough idea of complexity differences but provides no direct measure re\ufb02ecting the complexity.\nA number of proposals have been made to measure model complexity more directly. Consider a\nmodel M with k parameters, summarized in the parameter vector \u03b8 = (\u03b81, \u03b82, . . . , \u03b8k, ) which has a\nrange indicated by \u2126. Let d denote the data and p(d|\u03b8, M) the likelihood. The most straightforward\nmeasure of model complexity is the parametric complexity (PC), which simply counts the number\nof parameters:\n\nPC = k.\n\n(1)\n\nPC is attractive as a measure of model complexity since it is very easy to calculate. Further, it has a\ndirect and well understood relation toward complexity: the more parameters, the more complex the\nmodel. It is included as the complexity term of several generalizability measures such as AIC [8]\nand BIC [9], and it is at the heart of the Likelihood Ratio Test.\nDespite this intuitive appeal, PC is not free from problems. One problem with PC is that it re-\n\ufb02ects only a single aspect of complexity. Also the parameter range and the functional form (the\nway the parameters are combined in the model equation) in\ufb02uence a model\u2019s complexity, but these\ndimensions of complexity are ignored in PC [2, 6].\nA complexity measure that takes these three dimensions into account is provided by the geometric\ncomplexity (GC) measure, which is inspired by differential geometry [10]. In GC, complexity is\nconceptualized as the number of distinguishable probability distributions a model can generate. It is\nde\ufb01ned by\n\n(cid:112)det I(\u03b8|M)d\u03b8,\n\nGC = k\n2\n\nln n\n2\u03c0\n\n+ ln\n\n(cid:90)\n\n\u2126\n\n(2)\n\n(3)\n\nwhere n indicates the size of the data sample and I(\u03b8) is the Fisher Information Matrix:\n\nIij(\u03b8|M) = \u2212E\u03b8\n\n\u22022 ln p(d|\u03b8, M)\n\n\u2202\u03b8i\u2202\u03b8j\n\n.\n\nNote that I(\u03b8|M) is determined by the likelihood function p(d|\u03b8, M), which is in turn determined\nby the model equation. Hence GC is sensitive to the number of parameters (through k), the func-\ntional form (through I), and the range (through \u2126). Quite surprisingly, GC turns out to be equal\nto the complexity term used in one version of Minimum Description Length (MDL), a measure of\ngeneralizability developed within the domain of information theory [2, 11, 12, 13].\nGC contrasts favorably with PC, in the sense that it takes three dimensions of complexity into ac-\ncount rather than a single one. A major drawback of GC is that, unlike PC, it requires considerable\ntechnical sophistication to be computed, as it relies on the second derivative of the likelihood. A\nmore important limitation of both PC and GC is that these measures are insensitive to yet another\nimportant dimension contributing to model complexity: the prior distribution over the model pa-\nrameters. The relation between the parameter prior distribution and model complexity is discussed\nnext.\n\n2\n\n\f2 Model complexity and the parameter prior\n\nThe growing popularity of Bayesian methods in psychology has not only raised awareness that\nmodel complexity should be taken into account when testing models [6], it has also drawn attention\nto the fact that in many occasions, relevant prior information is available [14]. In Bayesian methods,\nthere is room to incorporate this information in two different \ufb02avors: as a prior distribution over the\nmodels, or as a prior distribution over the parameters. Specifying a model prior is a daunting task, so\nalmost invariably, the model prior is taken to be uniform (but see [15] for an exception). In contrast,\ninformation regarding the parameter is much easier to include, although still challenging (e.g., [16]).\nThere are two ways to formalize prior information about a model\u2019s parameters: using the parameter\nprior range (often referred to as simply the range) and using the parameter prior distribution (often\nreferred to as simply the prior). The prior range indicates which parameter values are allowed\nand which are forbidden. The prior distribution indicates which parameter values are likely and\nwhich are unlikely. Models that share the same equation and the same range but differ in the prior\ndistribution can be considered different models (or at least different model versions), just like models\nthat share the same equation but differ in range are different model versions. Like the parameter prior\nrange, the parameter prior distribution in\ufb02uences the model complexity. In general, a model with a\nvague parameter prior distribution is more complex than a model with a sharply peaked parameter\nprior distribution, much as a model with a broad-ranged parameter is more complex than the same\nmodel where the parameter is heavily restricted.\nTo drive home the point that the parameter prior should be considered when model complexity is\nassessed, consider the following \u201cfair coin\u201d model Mf and a \u201cbiased coin\u201d model Mb. There is a\nclear intuitive complexity difference between these models: Mb is more complex than Mf . The\nmost straightforward way to formalize these models is as follows, where ph denotes the probability\nof observing heads:\n\nfor model Mf and the triplet of equations\n\nph = 1/2,\n\nph = \u03b8\n0 \u2264 \u03b8 \u2264 1\np(\u03b8) = 1,\n\n(4)\n\n(5)\n\njointly de\ufb01ne model Mb. The range forbids values smaller than 0 or greater than 1 because ph is a\nproportion. As Mf and Mb have a different number of parameters, both PC and GC, being sensitive\nto the number of parameters, pick up the difference in model complexity between the models.\nAlternatively, model Mf could be de\ufb01ned as follows:\n\nph = \u03b8\n0 \u2264 \u03b8 \u2264 1\np(\u03b8) = \u03b4(\u03b8 \u2212 1\n),\n2\n\n(6)\n\nwhere \u03b4(x) is the Dirac delta. Note that the model formalized in Equation 6 is exactly identical the\nmodel formalized in Equation 4. However, relying on the formulation of model Mf in Equation 6,\nPC and GC now judge Mf and Mb to be equally complex: both models share the same model\nequation (which implies they have the same number of parameters and the same functional form) and\nthe same range for the parameter. Hence, PC and GC make an incorrect judgement of the complexity\ndifference between both models. This misjudgement is a direct result of the insensitivity of these\nmeasures to the parameter prior. As models Mf and Mb have different prior distributions over their\nparameter, a measure sensitive to the prior would pick up the complexity difference between these\nmodels. Such a measure is introduced next.\n\n3 The Prior Predictive Complexity\n\nModel complexity refers to the property of the model that enables it to predict a wide range of data\npatterns [2]. The idea of the PPC is to measure how wide this range exactly is. A complex model\n\n3\n\n\fcan predict many outcomes, and a simple model can predict a few outcomes only. Model simplicity,\nthen, refers to the property of placing restrictions on the possible outcomes: the greater restrictions,\nthe greater the simplicity.\nTo understand how model complexity is measured in the PPC, it is useful to think about the universal\ninterval (UI) and the predicted interval (PI). The universal interval is the range of outcomes that could\npotentially be observed, irrespective of any model. For example, in an experiment with n binomial\ntrials, it is impossible to observe less that zero successes, or more than n successes, so the range of\npossible outcomes is [0, n] . Similarly, the universal interval for a proportion is [0, 1]. The predicted\ninterval is the interval containing all outcomes the model predicts.\nAn intuitive way to gauge model complexity is then the cardinality of the predicted interval, relative\nto the cardinality of the universal interval, averaged over all m conditions or stimuli:\n\nPPC =\n\n1\nm\n\n|PIi|\n|UIi| .\n\n(7)\n\nm(cid:88)\n\ni=1\n\n(cid:90)\n\nA key aspect of the PPC is deriving the predicted interval. For a parameterized likelihood-based\nmodel, prediction takes the form of a distribution over all possible outcomes for some future, yet-to-\nbe-observed data d under some model M. This distribution is called the prior predictive distribution\n(ppd) and can be calculated using the law of total probability:\n\np(d|M) =\n\np(d|\u03b8, M)p(\u03b8|M)d\u03b8.\n\n(8)\n\n\u2126\n\nPredicting the probability of unseen future data d arising under the assumption that model M is true\ninvolves integrating the probability of the data for each of the possible parameter values, p(d|\u03b8, M),\nas weighted by the prior probability of each of these values, p(\u03b8|M).\nNote that the ppd relies on the number of parameters (through the number of integrals and the likeli-\nhood), the model equation (through the likelihood), and the parameter range (through \u2126). Therefore,\nas GC, the PPC is sensitive to all these aspects. In contrast to GC, however, the ppd, and hence the\nPPC, also relies on the parameter prior.\nSince predictions are made probabilistically, virtually all outcomes will be assigned some prior\nweight. This implies that, in principle, the predicted interval equals the universal interval. However,\nfor some outcomes the assigned weight will be extremely small. Therefore, it seems reasonable to\nrestrict the predicted interval to the smallest interval that includes some predetermined amount of the\nprior mass. For example, the 95% predictive interval is de\ufb01ned by those outcomes with the highest\nprior mass that together make up 95% of the prior mass.\nAnalytical solutions to the integral de\ufb01ning the ppd are rarely available. Instead, one should rely on\napproximations to the ppd by drawing samples from it. In the current study, sampling was performed\nusing WinBUGS [17, 18], a highly versatile, user friendly, and freely available software package.\nIt contains sophisticated and relatively general-purpose Markov Chain Monte Carlo (MCMC) algo-\nrithms to sample from any distribution of interest.\n\n4 An application example\n\nThe PPC is illustrated by comparing the complexity of two popular models of information integra-\ntion, which attempt to account for how people merge potentially ambiguous or con\ufb02icting informa-\ntion from various sensorial sources to create subjective experience. These models either assume that\nthe sources of information are combined additively (the Linear Integration Model; LIM; [19]) or\nmultiplicatively (the Fuzzy Logical Model of Perception; FLMP; [20, 21]).\n\n4.1\n\nInformation integration tasks\n\nA typical information integration task exposes participants simultaneously to different sources of\ninformation and requires this combined experience to be identi\ufb01ed in a forced-choice identi\ufb01cation\ntask. The presented stimuli are generated from a factorial manipulation of the sources of information\nby systematically varying the ambiguity of each of the sources. The relevant empirical data consist\n\n4\n\n\fof, for each of the presented stimuli, the counts km of the number of times the mth stimulus was\nidenti\ufb01ed as one of the response alternatives, out of the tm trials on which it was presented.\nFor example, an experiment in phonemic identi\ufb01cation could involve two phonemes to be identi\ufb01ed,\n/ba/ and /da/ and two sources of information, auditory and visual. Stimuli are created by crossing\ndifferent levels of audible speech, varying between /ba/ and /da/, with different levels of visible\nspeech, also varying between these alternatives. The resulting set of stimuli spans a continuum\nbetween the two syllables. The participant is then asked to listen and to watch the speaker, and\nbased on this combined audiovisual experience, to identify the syllable as being either /ba/ or /da/.\nIn the so-called expanded factorial design, not only bimodal stimuli (containing both auditory and\nvisual information) but also unimodal stimuli (providing only a single source of information) are\npresented.\n\n4.2\n\nInformation integration models\n\nIn what follows, the formal description of the LIM and the FLMP is outlined for a design with two\nresponse alternatives (/da/ or /ba/) and two sources (auditory and visual), with I and J levels,\nrespectively. In such a two-choice identi\ufb01cation task, the counts km follow a Binomial distribution:\n\nkm \u223c Binomial(pm, tm),\n\n(9)\n\nwhere pm indicates the probability that the mth stimulus is identi\ufb01ed as /da/.\n\n4.2.1 Model equation\n\nThe probability for the stimulus constructed with the ith level of the \ufb01rst source and the jth level of\nthe second being identi\ufb01ed as /da/ is computed according to the choice rule:\n\npij =\n\ns (ij, /da/)\n\ns (ij, /da/) + s (ij, /ba/) ,\n\n(10)\n\nwhere s (ij, /da/) represents the overall degree of support for the stimulus to be /da/.\nThe sources of information are assumed to be evaluated independently, implying that different pa-\nrameters are used for the different modalities. In the present example, the degree of auditory sup-\nport for /da/ is denoted by ai (i = 1, . . . , I) and the degree of visual support for /da/ by bj\n(j = 1, . . . , J).\nWhen a unimodal stimulus is presented, the overall degree of support for each alternative is given by\ns (i\u2217, /da/) = ai and s (\u2217j, /da/) = bj, where the asterisk (*) indicates the absence of information,\nimplying that Equation 10 reduces to\n\npi\u2217 = ai\n\nand p\u2217j = bj.\n\n(11)\n\nWhen a bimodal stimulus is presented, the overall degree of support for each alternative is based\non the integration or blending of both these sources. Hence, for bimodal stimuli, s (ij, /da/) =\nai\nduces to\n\n(cid:78) bj, where the operator(cid:78) denotes the combination of both sources. Hence, Equation 10 re-\nThe LIM assumes an additive combination, i.e.,(cid:78) = +, so Equation 12 becomes\nThe FLMP, in contrast, assumes a multiplicative combination, i.e.,(cid:78) = \u00d7, so Equation 12 becomes\n\npij = ai + bj\n\n(cid:78) bj + (1 \u2212 ai)(cid:78)(1 \u2212 bj) .\n\nai\n\n(cid:78) bj\n\npij =\n\nai\n\n(12)\n\n(13)\n\n.\n\n2\n\npij =\n\naibj\n\naibj + (1 \u2212 ai)(1 \u2212 bj) .\n\n(14)\n\n5\n\n\f4.2.2 Parameter prior range and distribution\n\nEach level of auditory and visual support for /da/ (i.e., ai and bj, respectively) is associated with a\nfree parameter, which implies that the FLMP and the LIM have an equal number of free parameters,\nI + J. Each of these parameters is constrained to satisfy 0 \u2264 ai, bj \u2264 1.\nThe original formulations of the LIM and FLMP unfortunately left the parameter priors unspeci\ufb01ed.\nHowever, an implicit assumption that has been commonly used is a uniform prior for each of the\nparameters. This assumption implicitly underlies classical and widely adopted methods for model\nevaluation using accounted percentage of variance or maximum likelihood.\n\nai \u223c Uniform(0, 1) and bi \u223c Uniform(0, 1)\n\ni = 1, . . . , I; j = 1, . . . , J.\nThe models relying on this set of uniform priors will be referred to as LIMu and FLMPu.\nNote that LIMu and FLMPu treat the different parameters as independent. This approach misses\nimportant information. In particular, the experimental design is such that the amount of support for\neach level i + 1 is always higher than for level i. Because parameter ai (or bi) corresponds to the\ndegree of auditory (or visual) support for a unimodal stimulus at the ith level, it seems reasonable to\nexpect the following orderings among the parameters to hold (see also [6]):\n\nfor\n\n(15)\n\nThe models relying on this set of ordered priors will be referred to as LIMo and FLMPo.\n\naj > ai\n\nand bj > bi\n\nfor\n\nj > i.\n\n(16)\n\n4.3 Complexity and experimental design\n\nIt is tempting to consider model complexity as an inherent characteristic of a model. For some mod-\nels and for some measures of complexity this is clearly the case. Consider, for example, model Mb.\nIn any experimental design (i.e., a number of coin tosses), PCMb = 1. However, more generally,\nthis is not the case. Focusing on the FLMP and the LIM, it is clear that even a simple measure as\nPC depends crucially on (some aspects of) the experimental design. In particular, every level corre-\nsponds to a new parameter, so PC = I + J . Similarly, GC is dependent on design choices. The PPC\nis not different in this respect.\nThe design sensitivity implies that one can only make sensible conclusions about differences in\nmodel complexity by using different designs. In an information integration task, the design deci-\nsions include the type of design (expanded or not), the number of sources, the number of response\nalternatives, the number of levels for each source, and the number of observations for each stimulus\n(sample size). The present study focuses on the expanded factorial designs with two sources and two\nresponse alternatives. The additional design features were varied: both a 5 \u00d7 5 and a 8 \u00d7 2 design\nwere considered, using three different sample sizes (20, 60 and 150, following [2]).\n\n4.4 Results\nFigure 1 shows the 99% predicted interval in the 8\u00d72 design with n = 150. Each panel corresponds\nto a different model. In each panel, each of the 26 stimuli is displayed on the x-axis. The \ufb01rst eight\nstimuli correspond to the stimuli with the lowest level of visual support, and are ordered in increasing\norder of auditory support. The next eight stimuli correspond to the stimuli with the highest level of\nvisual support. The next eight stimuli correspond to the unimodal stimuli where only auditory\ninformation is provided (again ranked in increasing order). The \ufb01nal two stimuli are the unimodal\nvisual stimuli.\nPanel A shows that the predicted interval of LIMu nearly equals the universal interval, ranging\nbetween 0 and 1. This indicates that almost all outcomes are given a non-negligible prior mass\nby LIMu, making it almost maximally complex. FLMPu is even more complex. The predicted\ninterval, shown in Panel B, virtually equals the universal interval, indicating that the model predicts\nvirtually every possible outcome. Panels C and D show the dramatic effect of incorporating relevant\nprior information in the models. The predicted intervals of both LIMo and FLMPo are much smaller\nthan their counterparts using the uniform priors.\nFocusing on the comparison between LIM and FLMP, the PPC indicates that the latter is more com-\nplex than the former. This observation holds irrespective of the model version (assuming uniform\n\n6\n\n\fA\n\nC\n\nB\n\nD\n\nFigure 1: The 99% predicted interval for each of the 26 stimuli (x-axis) according to LIMu (Panel\nA), FLMPu (Panel B), LIMo (Panel C), and FLMPo (Panel D).\n\nTable 1: PPC, based on the 99% predicted interval, for four models across six different designs.\n\n20\n\n0.97\n1\n0.75\n0.83\n\nLIMu\nFLMPu\nLIMo\nFLMPo\n\n5 \u00d7 5\n60\n\n0.94\n1\n0.67\n0.80\n\n8 \u00d7 2\n60\n\n0.95\n1\n0.69\n0.82\n\n150\n\n0.94\n0.99\n0.66\n0.81\n\n20\n\n.97\n1\n0.77\n0.86\n\n150\n\n0.93\n0.99\n0.64\n0.78\n\n7\n\n11 21 1* *1 00.10.20.30.40.50.60.70.80.91Proportion of /da/ responses11 21 1* *1 00.10.20.30.40.50.60.70.80.91Proportion of /da/ responses11 21 1* *1 00.10.20.30.40.50.60.70.80.91Proportion of /da/ responses11 21 1* *1 00.10.20.30.40.50.60.70.80.91Proportion of /da/ responses\fvs. ordered priors). The smaller complexity of LIM is in line with previous attempts to measure\nthe relative complexities of LIM and FLMP, such as the atheoretical simulation-based approach ([4]\nbut see [5]), the semi-theoretical simulation-based approach [4], the theoretical simulation-based\napproach [2, 6, 22], and a direct computation of the GC [2].\nThe PPC\u2019s for all six designs considered are displayed in Table 1. It shows that the observations\nmade for the 8 \u00d7 2, n = 150 design holds across the \ufb01ve remaining designs as well: LIM is simpler\nthan FLMP; and models assuming ordered priors are simpler than models assuming uniform priors.\nNote that these conclusions would not have been possible based on PC or GC. For PC, all four\nmodels have the same complexity. GC, in contrast, would detect complexity differences between\nLIM and FLMP (i.e., the \ufb01rst conclusion), but due to its insensitivity to the parameter prior, the\ncomplexity differences between LIMu and LIMo on the one hand, and FLMPu and FLMPo on the\nother hand (i.e., the second conclusion) would have gone unnoticed.\n\n5 Discussion\n\nA theorist de\ufb01ning a model should clearly and explicitly specify at least the three following pieces of\ninformation: the model equation, the parameter prior range, and the parameter prior distribution. If\nany of these pieces is missing, the model should be regarded as incomplete, and therefore untestable.\nConsequently, any measure of generalizability should be sensitive to all three aspects of the model\nde\ufb01nition. Many currently popular generalizability measures do not satisfy this criterion, including\nAIC, BIC and MDL. A measure of generalizability that does take these three aspects of a model into\naccount is the marginal likelihood [6, 7, 14, 23]. Often, the marginal likelihood is criticized exactly\nfor its sensitivity to the prior range and distribution (e.g., [24]). However, in the light of the fact that\nthe prior is a part of the model de\ufb01nition, I see the sensitivity of the marginal likelihood to the prior\nas an asset rather than a nuisance. It is precisely the measures of generalizability that are insensitive\nto the prior that miss an important aspect of the model.\nSimilarly, any stand alone measure of model complexity should be sensitive to all three aspects of the\nmodel de\ufb01nition, as all three aspects contribute to the model\u2019s complexity (with the model equation\ncontributing two factors: the number of parameters and the functional form). Existing measures of\ncomplexity do not satisfy this requirement and are therefore incomplete. PC takes only part of the\nmodel equation into account, whereas GC takes only the model equation and the range into account.\nIn contrast, the PPC currently proposed is sensitive to all these three aspects. It assesses model\ncomplexity using the predicted interval which contains all possible outcomes a model can generate.\nA narrow predicted interval (relative to the universal interval) indicates a simple model; a complex\nmodel is characterized by a wide predicted interval.\nThere is a tight coupling between the notions of information, knowledge and uncertainty, and the\nnotion of model complexity. As parameters correspond to unknown variables, having more in-\nformation available leads to fewer parameters and hence to a simpler model. Similarly, the more\ninformation there is available, the sharper the parameter prior, implying a simpler model. To put\nit differently, the less uncertainty present in a model, the narrower its predicted interval, and the\nsimpler the model. For example, in model Mb, there is maximal uncertainty. Nothing but the range\nis known about \u03b8, so all values of \u03b8 are equally likely. In contrast, in model Mf , there is minimal\nuncertainty. In fact, ph is known for sure, so only a single value of \u03b8 is possible. This difference in\nuncertainty is translated in a difference in complexity. The same is true for the information integra-\ntion models. Incorporating the order constraints in the priors reduces the uncertainty compared to\nthe models without these constraints (it tells you, for example, that parameter a1 is smaller than a2).\nThis reduction in uncertainty is re\ufb02ected by a smaller complexity.\nThere are many different sources of prior information that can be translated in a range or distribu-\ntion. The illustration using the information integration models highlighted that prior information\ncan re\ufb02ect meaningful information in the design. Alternatively, priors can be informed by previous\napplications of similar models in similar settings. Probably the purest form of priors are those that\ntranslate theoretical assumptions made by a model (see [16]). The fact that it is often dif\ufb01cult to for-\nmalize this prior information may not be used as an excuse to leave the prior unspeci\ufb01ed. Sure it is a\nchallenging task, but so is translating theoretical assumptions into the model equation. Formalizing\ntheory, intuitions, and information is what model building is all about.\n\n8\n\n\fReferences\n[1] Myung, I. J. (2000) The importance of complexity in model selection. Journal of Mathematical Psychol-\n\nogy, 44, 190\u2013204.\n\n[2] Pitt, M. A., Myung, I. J., and Zhang, S. (2002) Toward a method of selecting among computational models\n\nof cognition. Psychological Review, 109, 472\u2013491.\n\n[3] Shiffrin, R. M., Lee, M. D., Kim, W., and Wagenmakers, E. J. (2008) A survey of model evaluation\n\napproaches with a tutorial on hierarchical Bayesian methods. Cognitive Science, 32, 1248\u20131284.\n\n[4] Cutting, J. E., Bruno, N., Brady, N. P., and Moore, C. (1992) Selectivity, scope, and simplicity of models:\nA lesson from \ufb01tting judgments of perceived depth. Journal of Experimental Psychology: General, 121,\n364\u2013381.\n\n[5] Dunn, J. (2000) Model complexity: The \ufb01t to random data reconsidered. Psychological Research, 63,\n\n174\u2013182.\n\n[6] Myung, I. J. and Pitt, M. A. (1997) Applying Occam\u2019s razor in modeling cognition: A Bayesian approach.\n\nPsychonomic Bulletin & Review, 4, 79\u201395.\n\n[7] Vanpaemel, W. and Storms, G. (in press) Abstraction and model evaluation in category learning. Behavior\n\nResearch Methods.\n\n[8] Akaike, H. (1973) Information theory and an extension of the maximum likelihood principle. Petrov, B.\nand Csaki, B. (eds.), Second International Symposium on Information Theory, pp. 267\u2013281, Academiai\nKiado.\n\n[9] Schwarz, G. (1978) Estimating the dimension of a model. Annals of Statistics, 6, 461\u2013464.\n[10] Myung, I. J., Balasubramanian, V., and Pitt, M. A. (2000) Counting probability distributions: Differential\n\ngeometry and model selection. Proceedings of the National Academy of Sciences, 97, 11170\u201311175.\n\n[11] Lee, M. D. (2002) Generating additive clustering models with minimal stochastic complexity. Journal of\n\nClassi\ufb01cation, 19, 69\u201385.\n\n[12] Rissanen, J. (1996) Fisher information and stochastic complexity. IEEE Transactions on Information\n\nTheory, 42, 40\u201347.\n\n[13] Gr\u00a8unwald, P. (2000) Model selection based on minimum description length. Journal of Mathematical\n\nPsychology, 44, 133\u2013152.\n\n[14] Lee, M. D. and Wagenmakers, E. J. (2005) Bayesian statistical inference in psychology: Comment on\n\nTra\ufb01mow (2003). Psychological Review, 112, 662\u2013668.\n\n[15] Lee, M. D. and Vanpaemel, W. (2008) Exemplars, prototypes, similarities and rules in category represen-\n\ntation: An example of hierarchical Bayesian analysis. Cognitive Science, 32, 1403\u20131424.\n\n[16] Vanpaemel, W. and Lee, M. D. (submitted) Using priors to formalize theory: Optimal attention and the\n\ngeneralized context model.\n\n[17] Lee, M. D. (2008) Three case studies in the Bayesian analysis of cognitive models. Psychonomic Bulletin\n\n& Review, 15, 1\u201315.\n\n[18] Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2004) WinBUGS User Manual Version 2.0. Medi-\n\ncal Research Council Biostatistics Unit. Institute of Public Health, Cambridge.\n\n[19] Anderson, N. H. (1981) Foundations of information integration theory. Academic Press.\n[20] Oden, G. C. and Massaro, D. W. (1978) Integration of featural information in speech perception. Psycho-\n\nlogical Review, 85, 172\u2013191.\n\n[21] Massaro, D. W. (1998) Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. MIT\n\nPress.\n\n[22] Massaro, D. W., Cohen, M. M., Campbell, C. S., and Rodriguez, T. (2001) Bayes factor of model selection\n\nvalidates FLMP. Psychonomic Bulletin and Review, 8, 1\u201317.\n\n[23] Kass, R. E. and Raftery, A. E. (1995) Bayes factors. Journal of the American Statistical Association, 90,\n\n773\u2013795.\n\n[24] Liu, C. C. and Aitkin, M. (2008) Bayes factors: Prior sensitivity and model generalizability. Journal of\n\nMathematical Psychology, 53, 362\u2013375.\n\n9\n\n\f", "award": [], "sourceid": 1210, "authors": [{"given_name": "Wolf", "family_name": "Vanpaemel", "institution": null}]}