{"title": "Occam's Razor", "book": "Advances in Neural Information Processing Systems", "page_first": 294, "page_last": 300, "abstract": null, "full_text": "Occam\u00b7s Razor \n\nCarl Edward Rasmussen \n\nDepartment of Mathematical Modelling \n\nTechnical University of Denmark \n\nBuilding 321, DK-2800 Kongens Lyngby, Denmark \ncarl@imm . dtu . dk http : //bayes . imm . dtu . dk \n\nZoubin Ghahramani \n\nGatsby Computational Neuroscience Unit \n\nUniversity College London \n\n17 Queen Square, London WCIN 3AR, England \n\nzoubin@gatsby . ucl . ac . uk http : //www . g a tsby . ucl .ac . uk \n\nAbstract \n\nThe Bayesian paradigm apparently only sometimes gives rise to Occam's \nRazor; at other times very large models perform well. We give simple \nexamples of both kinds of behaviour. The two views are reconciled when \nmeasuring complexity of functions, rather than of the machinery used to \nimplement them. We analyze the complexity of functions for some linear \nin the parameter models that are equivalent to Gaussian Processes, and \nalways find Occam's Razor at work. \n\n1 Introduction \n\nOccam's Razor is a well known principle of \"parsimony of explanations\" which is influen(cid:173)\ntial in scientific thinking in general and in problems of statistical inference in particular. In \nthis paper we review its consequences for Bayesian statistical models, where its behaviour \ncan be easily demonstrated and quantified. One might think that one has to build a prior \nover models which explicitly favours simpler models. But as we will see, Occam's Razor is \nin fact embodied in the application of Bayesian theory. This idea is known as an \"automatic \nOccam's Razor\" [Smith & Spiegelhalter, 1980; MacKay, 1992; Jefferys & Berger, 1992]. \n\nWe focus on complex models with large numbers of parameters which are often referred to \nas non-parametric. We will use the term to refer to models in which we do not necessarily \nknow the roles played by individual parameters, and inference is not primarily targeted at \nthe parameters themselves, but rather at the predictions made by the models. These types \nof models are typical for applications in machine learning. \n\nFrom a non-Bayesian perspective, arguments are put forward for adjusting model com(cid:173)\nplexity in the light of limited training data, to avoid over-fitting. Model complexity is often \nregulated by adjusting the number offree parameters in the model and sometimes complex(cid:173)\nity is further constrained by the use of regularizers (such as weight decay). If the model \ncomplexity is either too low or too high performance on an independent test set will suffer, \ngiving rise to a characteristic Occam's Hill. Typically an estimator of the generalization \nerror or an independent validation set is used to control the model complexity. \n\n\fFrom the Bayesian perspective, authors seem to take two conflicting stands on the question \nof model complexity. One view is to infer the probability of the model for each of several \ndifferent model sizes and use these probabilities when making predictions. An alternate \nview suggests that we simply choose a \"large enough\" model and sidestep the problem of \nmodel size selection. Note that both views assume that parameters are averaged over. Ex(cid:173)\nample: Should we use Occam's Razor to determine the optimal number of hidden units in a \nneural network or should we simply use as many hidden units as possible computationally? \nWe now describe these two views in more detail. \n\n1.1 View 1: Model size selection \n\nOne of the central quantities in Bayesian learning is the evidence, the probability of the data \ngiven the model P(YIM i ) computed as the integral over the parameters W of the likelihood \ntimes the prior. The evidence is related to the probability of the model, P(MiIY) through \nBayes rule: \n\nwhere it is not uncommon that the prior on models P(M i ) is flat, such that P(MiIY) is \nproportional to the evidence. Figure 1 explains why the evidence discourages overcomplex \nmodels, and can be used to selectl the most probable model. \n\nIt is also possible to understand how the evidence discourages overcomplex models and \ntherefore embodies Occam's Razor by using the following interpretation. The evidence is \nthe probability that if you randomly selected parameter values from your model class, you \nwould generate data set Y. Models that are too simple will be very unlikely to generate \nthat particular data set, whereas models that are too complex can generate many possible \ndata sets, so again, they are unlikely to generate that particular data set at random. \n\n1.2 View 2: Large models \n\nIn non-parametric Bayesian models there is no statistical reason to constrain models, as \nlong as our prior reflects our beliefs. In fact, since constraining the model order (i.e. num(cid:173)\nber of parameters) to some small number would not usually fit in with our prior beliefs \nabout the true data generating process, it makes sense to use large models (no matter how \nmuch data you have) and pursue the infinite limit if you can2 \u2022 For example, we ought not \nto limit the number of basis functions in function approximation a priori since we don't \nreally believe that the data was actually generated from a small number of fixed basis func(cid:173)\ntions. Therefore, we should consider models with as many parameters as we can handle \ncomputationally. \n\nNeal [1996] showed how multilayer perceptrons with large numbers of hidden units \nachieved good performance on small data sets. He used sophisticated MCMC techniques \nto implement averaging over parameters. Following this line of thought there is no model \ncomplexity selection task: We don't need to evaluate evidence (which is often difficult) \nand we don't need or want to use Occam's Razor to limit the number of parameters in our \nmodel. \n\n'We really ought to average together predictions from all models weighted by their probabilities. \nHowever if the evidence is strongly peaked, or for practical reasons, we may want to select one as an \napproximation. \n\n2Por some models, the limit of an infinite number of parameters is a simple model which can be \ntreated tractably. Two examples are the Gaussian Process limit of Bayesian neural networks [Neal, \n1996], and the infinite limit of Gaussian mixture models [Rasmussen, 2000]. \n\n\ftoo complex \n\ny \n\nAll possible data sets \n\nFigure 1: Left panel: the evidence as a function of an abstract one dimensional represen(cid:173)\ntation of \"all possible\" datasets. Because the evidence must \"normalize\", very complex \nmodels which can account for many data sets only achieve modest evidence; simple models \ncan reach high evidences, but only for a limited set of data. When a dataset Y is observed, \nthe evidence can be used to select between model complexities. Such selection cannot be \ndone using just the likelihood, P(Y Iw, Mi). Right panel: neural networks with different \nnumbers of hidden unit form a family of models, posing the model selection problem. \n\n2 Linear in the parameters models - Example: the Fourier model \n\nFor simplicity, consider function approximation using the class of models that are linear in \nthe parameters; this class includes many well known models such as polynomials, splines, \nkernel methods, etc: \n\ny(x) = L Wi(Pi(X) {:} Y = W T <1>, \n\nwhere y is the scalar output, ware the unknown weights (parameters) of the model, (/>i(x) \nare fixed basis functions, in = \u00a2i(X(n)) and x(n) is the (scalar or vector) input for exam(cid:173)\nple number n. For example, a Fourier model for scalar inputs has the form: \n\ny(x) = ao + Lad sin(dx) + bd cos(dx), \n\nD \n\nd=l \n\nwhere w \nweights: \n\n{ao,al,bl, ... ,aD,bD}' Assuming an independent Gaussian prior on the \n\np(wIS, c) ex: exp (- ~ [Coa~ + L cd(a~ + b~)]), \n\nD \n\nd=l \n\nwhere S is an overall scale and Cd are precisions (inverse variances) for weights of order \n(frequency) d. It is easy to show that Gaussian priors over weights imply Gaussian Process \npriors over functions3 . The covariance function for the corresponding Gaussian Process \nprior is: \n\nK(x,x') = [Lcos(d(x-x'))/Cd]/S. \n\nD \n\nd=O \n\n3U nder the prior, the joint density of any (finite) set of outputs y is Gaussian \n\n\fOrder 0 \n\nOrder 1 \n\nOrder 2 \n\nOrder 3 \n\nOrder 4 \n\nOrder 5 \n\n2 \n\n2 \n\n2 \n\n2 \n\n0 \n\n2 \n\n.. _i .J . \n0 .. \"+ ... \nct. \n... \n\n-1 \n\n-2 \n\n-1 \n\n-2 \n\n-1 \n\n0 \n\n1 \nOrder 6 \n\n-1 \n\n0 \n\n1 \nOrder 7 \n\n-1 \n\n0 \n\n1 \nOrderS \n\n2 \n\n-1 \n\n-2 \n\n+i-\n\n-1 \n\n-2 \n\n... \n\n... \n\n-2 \n\n-1 \n\n0 \n\n1 \nOrder 9 \n\n-1 \n\n2 \n\n-1 \n\n-1 \n\n-2 \n\n... \n\n... \n\n-2 \n\n-1 \n\n0 \n\n1 \nOrder 10 \n\n-1 \n\n0 \n\n1 \nOrder 11 \n\n2 \n\n2 \n\n... \n\n-2 \n\n... \n\n-2 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n0.25 \n\n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\n0 \n\n0 \n\n2 \n\n3 \n\n4 \n\n5 \n6 \nModel order \n\n7 \n\nS \n\n9 \n\n10 \n\n11 \n\nFigure 2: Top: 12 different model orders for the \"unscaled\" model: Cd ex 1. The mean \npredictions are shown with a full line, the dashed and dotted lines limit the 50% and 95% \ncentral mass of the predictive distribution (which is student-t). Bottom: posterior probabil(cid:173)\nity of the models, normalised over the 12 models. The probabilities of the models exhibit \nan Occam's Hill, discouraging models that are either \"too small\" or \"too big\". \n\nInference in the Fourier model \n\n2.1 \nGiven data V = {x(n), y(n) In = 1, ... ,N} with independent Gaussian noise with preci(cid:173)\nsion T, the likelihood is: \n\np(Ylx, w, T) ex II exp (- ~[y(n) - W T n]2). \n\nN \n\nn=1 \n\nFor analytical convenience, let the scale of the prior be proportional to the noise precision, \nS = CT and put vague4 Gamma priors on T and C: \n\np(T) ex T<>1-1 exp(-,81T), \n\np(C) ex C<>2-1 exp (-,82 C) , \n\nthen we can integrate over weights and noise to get the evidence as a function of prior \nhyperparameters, C (the overall scale) and c (the relative scales): \n\nff \n\n,8<>1,8<>2r(a1+ N/ 2) \nE(C, c) = }} p(Ylx, w, T)p(wIC, T, c)p(T)p(C)dTdw = (~7r)~/2r(a1)r(a2) \nx IA11/2 [,81 + ~Y T (J - * A -1<1> T)yr<>1-N/2CD+<>2-1/2 exp( -,82C)~/2 II Cd, \n4We choose vague priors by setting al = a2 = fA = /32 = 0.2 throughout. \n\nd=1 \n\nD \n\n\fScaling Exponent=O \n\nScaling Exponent=2 \n\nScaling Exponent=3 \n\nScaling Exponent=4 \n\n2 \n\n2 \n\nV'J~ o ~ \n\n\\\" \n\na \n\n- 2 \n\n- 1 \n\n- 2 \n\n- 1 \n\n- 2 \n\n- 1 \n\n- 2 \n\n- 2 \n\n2 \n\n- 2 \n\na \n\n2 \n\n- 2 \n\na \n\n- 2 \n\n2 \n\nFigure 3: Functions drawn at random from the Fourier model with order D = 6 (dark) \nand D = 500 (light) for four different scalings; limiting behaviour from left to right: \ndiscontinuous, Brownian, borderline smooth, smooth. \n\nwhere A = cpT cp + C diag(c), and the tilde indicates duplication of all components except \nfor the first. We can optimizeS the overall scale C of the weights (using ego Newton's \nmethod). How do we choose the relative scales, c? The answer to this question turns out \nto be intimately related to the two different views of Bayesian inference. \n\n2.2 Example \n\nTo illustrate the behaviour of this model we use data generated from a step function that \nchanges from -1 to 1 corrupted by independent additive Gaussian noise with variance \n0.25. Note that the true function cannot be implemented exactly with a model of finite \norder, as would typically be the case in realistic modelling situations (the true function is \nnot \"realizable\" or the model is said to be \"incomplete\"). The input points are arranged in \ntwo lumps of 16 and 8 points, the step occurring in the middle of the larger, see figure 2. \n\nIf we choose the scaling precisions to be independent of the frequency of the contributions, \nCd ex 1 (while normalizing the sum of the inverse precisions) we achieve predictions as \ndepicted in figure 2. We clearly see an Occam's Razor behaviour. A model order of around \nD = 6 is preferred. One might say that the limited data does not support models more \ncomplex than this. One way of understanding this is to note that as the model order grows, \nthe prior parameter volume grows, but the relative posterior volume decreases, because \nparameters must be accurately specified in the complex model to ensure good agreement \nwith the data. The ratio of prior to posterior volumes is the Occam Factor, which may be \ninterpreted as a penalty to pay for fitting parameters. \n\nIn the present model, it is easy to draw functions at random from the prior by simply draw(cid:173)\ning values for the coefficients from their prior distributions. The left panel of figure 3 shows \nsamples from the prior for the previous example for D = 6 and D = 500. With increasing \norder the functions get more and more dominated by high frequency components. In most \nmodelling applications however, we have some prior expectations about smoothness. By \nscaling the precision factors Cd we can achieve that the prior over functions converges to \nfunctions with particular characteristics as D grows towards infinity. Here we will focus \non scalings of the form Cd = d'Y for different values of ,,(, the scaling exponent. As an \nexample, if we choose the scaling Cd = d3 we do not get an Occam's Razor in terms of the \norder of the model, figure 4. Note that the predictions and their errorbars become almost \nindependent of the model order as long as the order is large enough. Note also that the \nerrorbars for these large models seem more reasonable than for D = 6 in figure 2 (where a \nspurious \"dip\" between the two lumps of data is predicted with high confidence). With this \nchoice of scaling, it seems that the \"large models\" view is appropriate. \n\n50f course, we ought to integrate over C, but unfortunately that is difficult. \n\n\fOrder 0 \n\nOrder 1 \n\nOrder 2 \n\nOrder 3 \n\nOrder 4 \n\nOrder 5 \n\n2 .. _i .J . \n\n0 \u00b7\u00b7 .+ ... \nt .. \n* \n\n-1 \n\n-2 \n\n-1 \n\n0 \n\n1 \nOrder 6 \n\n-1 \n\n0 \n\n1 \nOrder 7 \n\n-1 \n\n0 \n\n1 \nOrderS \n\n-1 \n\n0 \n\n1 \nOrder 9 \n\n-1 \n\n0 \n\n1 \nOrder 10 \n\n-1 \n\n0 \n\n1 \nOrder 11 \n\n2 \n\n2 \n\n2 \n\n-1 \n\n-2 \n\n-2 \n\n2 \n\no \n-1 \n\n-2 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n-1 \n\n0 \n\n1 \n\n2 \n\no \n-1 \n\n-2 \n\n0.25 \n\n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\nO~~~----~----\n\n2 \n\no \n\n3 \n\n4 \n\n5 \n6 \nModel order \n\n7 \n\nS \n\n9 \n\n10 \n\n11 \n\nFigure 4: The same as figure 2, except that the scaling Cd = d3 was used here, leading to a \nprior which converges to smooth functions as D -t 00. There is no Occam's Razor; instead \nwe see that as long as the model is complex enough, the evidence is flat. We also notice \nthat the predictive density of the model is unchanged as long as D is sufficiently large. \n\n3 Discussion \n\nIn the previous examples we saw that, depending on the scaling properties of the prior over \nparameters, both the Occam's Razor view and the large models view can seem appropriate. \nHowever, the example was unsatisfactory because it is not obvious how to choose the scal(cid:173)\ning exponent 'Y. We can gain more insight into the meaning of'Y by analysing properties of \nfunctions drawn from the prior in the limit of large D. It is useful to consider the expected \nsquared difference of outputs corresponding to nearby inputs, separated by ~: \n\nG(~) = E[(J(x) - f(x + ~))2l, \n\nin the limit as ~ -t O. In the table in figure 5 we have computed these limits for various \nvalues of 'Y, together with the characteristics of these functions. For example, a property \nof smooth functions is that G (~) 3 \n\nlimD-.--+o G(~} \n\n1 \n~ \n\nproperties \n\ndiscontinuous \n\nBrownian \n\n~2(1-ln~) \n\nborderline smooth \n\n~2 \n\nsmooth \n\nFigure 5: Left panel: the evidence as a function of the scaling exponent, 'Y and overall scale \nC, has a maximum at 'Y = 3. The table shows the characteristics of functions for different \nvalues of 'Y. Examples of these functions are shown in figure 3. \n\n4 Conclusion \n\nWe have reviewed the automatic Occam's Razor for Bayesian models and seen how, while \nnot necessarily penalising the number of parameters, this process is active in terms of the \ncomplexity offunctions. Although we have only presented simplistic examples, the expla(cid:173)\nnations of the behaviours rely on very basic principles that are generally applicable. Which \nof the two differing Bayesian views is most attractive depends on the circumstances: some(cid:173)\ntimes the large model limit may be computationally demanding; also, it may be difficult \nto analyse the scaling properties of priors for some models. On the other hand, in typical \napplications of non-parametric models, the \"large model\" view may be the most convenient \nway of expressing priors since typically, we don't seriously believe that the \"true\" gener(cid:173)\native process can be implemented exactly with a small model. Moreover, optimizing (or \nintegrating) over continuous hyperparameters may be easier than optimizing over the dis(cid:173)\ncrete space of model sizes. In the end, whichever view we take, Occam's Razor is always \nat work discouraging overcomplex models. \n\nAcknowledgements \n\nThis work was supported by the Danish Research Councils through the Computational \nNeural Network Center (CONNECT) and the THOR Center for Neuroinformatics. Thanks \nto Geoff Hinton for asking a puzzling question which stimulated the writing of this paper. \n\nReferences \n\nJefferys, W. H. & Berger, J. O. (1992) Ockham's Razor and Bayesian Analysis. Amer. Sci., 80:64-72. \n\nMacKay, D. J. C. (1992) Bayesian Interpolation. Neural Computation, 4(3):415-447. \n\nNeal, R. M. (1996) Bayesian Learning for Neural Networks, Lecture Notes in Statistics No. 118, \nNew York: Springer-Verlag. \n\nRasmussen, C. E. (2000) The Infinite Gaussian Mixture Model, in S. A. Solla, T. K. Leen and \nK.-R. Muller (editors.), Adv. Neur. In! Proc. Sys. 12, MIT Press, pp. 554-560. \n\nSmith, A. F. M. & Spiegelhalter, D. J. (1980) Bayes factors and choice criteria for linear models. \n1. Roy. Stat. Soc. , 42:213-220. \n\n\f", "award": [], "sourceid": 1925, "authors": [{"given_name": "Carl", "family_name": "Rasmussen", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}]}*