{"title": "Catching Up Faster in Bayesian Model Selection and Model Averaging", "book": "Advances in Neural Information Processing Systems", "page_first": 417, "page_last": 424, "abstract": null, "full_text": "Catching Up Faster in Bayesian\n\nModel Selection and Model Averaging\n\nTim van Erven\n\nPeter Gr\u00a8unwald\n\nSteven de Rooij\n\nCentrum voor Wiskunde en Informatica (CWI)\n\nKruislaan 413, P.O. Box 94079\n\n1090 GB Amsterdam, The Netherlands\n\n{Tim.van.Erven,Peter.Grunwald,Steven.de.Rooij}@cwi.nl\n\nAbstract\n\nBayesian model averaging, model selection and their approximations such as BIC\nare generally statistically consistent, but sometimes achieve slower rates of con-\nvergence than other methods such as AIC and leave-one-out cross-validation. On\nthe other hand, these other methods can be inconsistent. We identify the catch-up\nphenomenon as a novel explanation for the slow convergence of Bayesian meth-\nods. Based on this analysis we de\ufb01ne the switch-distribution, a modi\ufb01cation of the\nBayesian model averaging distribution. We prove that in many situations model\nselection and prediction based on the switch-distribution is both consistent and\nachieves optimal convergence rates, thereby resolving the AIC-BIC dilemma. The\nmethod is practical; we give an ef\ufb01cient algorithm.\n\n1 Introduction\n\nWe consider inference based on a countable set of models (sets of probability distributions), focusing\non two tasks: model selection and model averaging. In model selection tasks, the goal is to select\nthe model that best explains the given data. In model averaging, the goal is to \ufb01nd the weighted\ncombination of models that leads to the best prediction of future data from the same source.\n\nAn attractive property of some criteria for model selection is that they are consistent under weak\nconditions, i.e. if the true distribution P \u2217 is in one of the models, then the P \u2217-probability that this\nmodel is selected goes to one as the sample size increases. BIC [14], Bayes factor model selection\n[8], Minimum Description Length (MDL) model selection [3] and prequential model validation [5]\nare examples of widely used model selection criteria that are usually consistent. However, other\nmodel selection criteria such as AIC [1] and leave-one-out cross-validation (LOO) [16], while of-\nten inconsistent, do typically yield better predictions. This is especially the case in nonparametric\nsettings, where P \u2217 can be arbitrarily well-approximated by a sequence of distributions in the (para-\nmetric) models under consideration, but is not itself contained in any of these. In many such cases,\nthe predictive distribution converges to the true distribution at the optimal rate for AIC and LOO\n[15, 9], whereas in general BIC, the Bayes factor method and prequential validation only achieve\nthe optimal rate to within an O(log n) factor [13, 20, 6]. In this paper we reconcile these seemingly\ncon\ufb02icting approaches [19] by improving the rate of convergence achieved in Bayesian model se-\nlection without losing its convergence properties. First we provide an example to show why Bayes\nsometimes converges too slowly.\nGiven priors on models M1, M2, . . . and parameters therein, Bayesian inference associates each\nmodel Mk with the marginal distribution pk, given in (1), obtained by averaging over the parameters\naccording to the prior. In model selection the preferred model is the one with maximum a posteriori\nprobability. By Bayes\u2019 rule this is arg maxk pk(xn)w(k), where w(k) denotes the prior probability\nof Mk. We can further average over model indices, a process called Bayesian Model Averaging\n(BMA). The resulting distribution pbma(xn) = Pk pk(xn)w(k) can be used for prediction. In a se-\n\n1\n\n\fquential setting, the probability of a data sequence xn := x1, . . . , xn under a distribution p typically\ndecreases exponentially fast in n. It is therefore common to consider \u2212 log p(xn), which we call the\ncodelength of xn achieved by p. We take all logarithms to base 2, allowing us to measure codelength\nin bits. The name codelength refers to the correspondence between codelength functions and prob-\nability distributions based on the Kraft inequality, but one may also think of the codelength as the\naccumulated log loss that is incurred if we sequentially predict the xi by conditioning on the past,\ni.e. using p(\u00b7|xi\u22121) [3, 6, 5, 11]. For BMA, we have \u2212 log pbma(xn) = Pn\ni=1 \u2212 log pbma(xi|xi\u22121).\nHere the ith term represents the loss incurred when predicting xi given xi\u22121 using pbma(\u00b7|xi\u22121),\nwhich turns out to be equal to the posterior average: pbma(xi|xi\u22121) = Pk pk(xi|xi\u22121)w(k|xi\u22121).\nPrediction using pbma has the advantage that the codelength it achieves on xn is close to the code-\nlength of p\u02c6k, where \u02c6k is the index of best of the marginals p1, p2, . . . Namely, given a prior w on\nmodel indices, the difference between \u2212 log pbma(xn) = \u2212 log(Pk pk(xn)w(k)) and \u2212 log p\u02c6k(xn)\nmust be in the range [0, \u2212 log w(\u02c6k)], whatever data xn are observed. Thus, using BMA for pre-\ndiction is sensible if we are satis\ufb01ed with doing essentially as well as the best model under con-\nsideration. However, it is often possible to combine p1, p2, . . . into a distribution that achieves\nsmaller codelength than p\u02c6k! This is possible if the index \u02c6k of the best distribution changes with\nthe sample size in a predictable way. This is common in model selection, for example with nested\nmodels, say M1 \u2282 M2. In this case p1 typically predicts better at small sample sizes (roughly,\nbecause M2 has more parameters that need to be learned than M1), while p2 predicts better\neventually. Figure 1 illustrates this phenomenon. It shows the accumulated codelength difference\n\u2212 log p2(xn) \u2212 (\u2212 log p1(xn)) on \u201cThe Picture of Dorian Gray\u201d by Oscar Wilde, where p1 and p2\nare the Bayesian marginal distributions for the \ufb01rst-order and second-order Markov chains, respec-\ntively, and each character in the book is an outcome. Note that the example models M1 and M2 are\nvery crude; for this particular application much better models are available. In more complicated,\nmore realistic model selection scenarios, the models may still be wrong, but it may not be known\nhow to improve them. Thus M1 and M2 serve as a simple illustration only. We used uniform priors\non the model parameters, but for other common priors similar behaviour can be expected. Clearly\np1 is better for about the \ufb01rst 100 000 outcomes, gaining a head start of approximately 40 000 bits.\nIdeally we should predict the initial 100 000 outcomes using p1 and the rest using p2. However, pbma\nonly starts to behave like p2 when it catches up with p1 at a sample size of about 310 000, when the\ncodelength of p2 drops below that of p1. Thus, in the shaded area pbma behaves like p1 while p2 is\nmaking better predictions of those outcomes: since at n = 100 000, p2 is 40 000 bits behind, and at\nn = 310 000, it has caught up, in between it must have outperformed p1 by 40 000 bits!\nThe general pattern that \ufb01rst one model is\nbetter and then another occurs widely, both\non real-world data and in theoretical set-\ntings. We argue that failure to take this\neffect into account leads to the suboptimal\nrate of convergence achieved by Bayes fac-\ntor model selection and related methods.\nWe have developed an alternative method\nto combine distributions p1 and p2 into a\nsingle distribution psw, which we call the\nswitch-distribution, de\ufb01ned in Section 2.\nFigure 1 shows that psw behaves like p1\ninitially, but in contrast to pbma it starts\nto mimic p2 almost immediately after p2\nstarts making better predictions; it essen-\ntially does this no matter what sequence xn is actually observed. psw differs from pbma in that it\nis based on a prior distribution on sequences of models rather than simply a prior distribution on\nmodels. This allows us to avoid the implicit assumption that there is one model which is best at\nall sample sizes. After conditioning on past observations, the posterior we obtain gives a better\nindication of which model performs best at the current sample size, thereby achieving a faster rate\nof convergence. Indeed, the switch-distribution is related to earlier algorithms for tracking the best\nexpert developed in the universal prediction literature [7, 18, 17, 10]; however, the applications we\nhave in mind and the theorems we prove are completely different. In Sections 3 and 4 we show\nthat model selection based on the switch-distribution is consistent (Theorem 1), but unlike standard\n\nFigure 1: The Catch-up Phenomenon\n\n 50000 100000 150000 200000 250000 300000 350000 400000 450000\n\nMarkov order 2\nBayesian Model Averaging\nSwitch\u2212Distribution\n\nSample size\n\n\u2212100000\n\n)\ns\nt\ni\nb\n(\n \n1\n \nr\ne\nd\nr\no\n \nv\no\nk\nr\na\n\nM\n \nh\nt\ni\n\nw\n \ne\nc\nn\ne\nr\ne\nf\nf\ni\nd\n\n \n\nh\nt\ng\nn\ne\nl\ne\nd\no\nC\n\n\u221220000\n\n 60000\n\n 40000\n\n 20000\n\n 0\n\n 0\n\n\u221240000\n\n\u221260000\n\n\u221280000\n\n2\n\n\fBayes factor model selection achieves optimal rates of convergence (Theorem 2). Proofs of the\ntheorems are in Appendix A. In Section 5 we give a practical algorithm that computes the switch-\ndistribution for K (rather than 2) predictors in \u0398(n \u00b7 K) time. In the full paper, we will give further\ndetails of the proof of Theorem 1 and a more detailed discussion of Theorem 2 and the implications\nof both theorems.\n\n2 The Switch-Distribution for Model Selection and Prediction\n\nPreliminaries Suppose X\u221e = (X1, X2, . . .) is a sequence of random variables that take values\nin sample space X \u2286 Rd for some d \u2208 Z+ = {1, 2, . . .}. For n \u2208 N = {0, 1, 2, . . .}, let xn = (x1,\n. . ., xn) denote the \ufb01rst n outcomes of X\u221e, such that xn takes values in the product space X n =\nn=0 X n. For m > n, we write\n\nX1 \u00d7 \u00b7 \u00b7 \u00b7 \u00d7 Xn. (We let x0 denote the empty sequence.) Let X \u2217 = S\u221e\nn+1 for (Xn+1, . . ., Xm), where m = \u221e is allowed and we omit the subscript when n = 0.\n\nX m\nAny distribution P (X\u221e) may be de\ufb01ned by a sequential prediction strategy p that predicts the\nnext outcome at any time n \u2208 N. To be precise: Given the previous outcomes xn at time n, this\nprediction strategy should issue a conditional density p(Xn+1|xn) with corresponding distribution\nP (Xn+1|xn) for the next outcome Xn+1. Such sequential prediction strategies are sometimes called\nprequential forecasting systems [5]. An instance is given in Example 1 below. We assume that the\ndensity p(Xn+1|xn) is taken relative to either the usual Lebesgue measure (if X is continuous)\nor the counting measure (if X is countable). In the latter case p(Xn+1|xn) is a probability mass\nfunction. It is natural to de\ufb01ne the joint density p(xm|xn) = p(xn+1|xn) \u00b7 \u00b7 \u00b7 p(xm|xm\u22121) and let\nn+1|xn) is the density of its\nP (X\u221e\nmarginal distribution for X m\nn+1|xn) is well-de\ufb01ned even if X is continuous,\nwe impose the natural requirement that for any k \u2208 Z+ and any \ufb01xed event Ak+1 \u2286 Xk+1 the\nprobability P (Ak+1|xk) is a measurable function of xk, which holds automatically if X is countable.\n\nn+1|xn) be the unique distribution such that, for all m > n, p(X m\n\nn+1. To ensure that P (X\u221e\n\nModel Selection and Prediction The goal in model selection is to choose an explanation for\nobserved data xn from a potentially in\ufb01nite list of candidate models M1, M2, . . . We consider\nparametric models, which are sets {p\u03b8 : \u03b8 \u2208 \u0398} of prediction strategies p\u03b8 that are indexed by ele-\nments of \u0398 \u2286 Rd, for some smallest possible d \u2208 N, the number of degrees of freedom. Examples\nof model selection are regression based on a set of basis functions such as polynomials (d is the\nnumber of coef\ufb01cients of the polynomial), the variable selection problem in regression [15, 9, 20]\n(d is the number of variables), and histogram density estimation [13] (d is the number of bins). A\nmodel selection criterion is a function \u03b4 : X \u2217 \u2192 Z+ that, given any data sequence xn \u2208 X \u2217, selects\nthe model Mk with index k = \u03b4(xn).\nWe associate each model Mk with a single prediction strategy \u00afpk. The bar emphasizes that \u00afpk is a\nmeta-strategy based on the prediction strategies in Mk. In many approaches to model selection, for\nexample AIC and LOO, \u00afpk is de\ufb01ned using some estimator \u02c6\u03b8k for each model Mk, which maps a\nsequence xn of previous observations to an estimated parameter value that represents a \u201cbest guess\u201d\nof the true/best distribution in the model. Prediction is then based on this estimator: \u00afpk(Xn+1 |\nxn) = p\u02c6\u03b8k(xn)(Xn+1 | xn), which also de\ufb01nes a joint density \u00afpk(xn) = \u00afpk(x1) \u00b7 \u00b7 \u00b7 \u00afpk(xn|xn\u22121).\nThe Bayesian approach to model selection or model averaging goes the other way around. We start\nout with a prior w on \u0398k, and de\ufb01ne the Bayesian marginal density\n\n\u00afpk(xn) = Z\u03b8\u2208\u0398k\n\np\u03b8(xn)w(\u03b8) d\u03b8.\n\n(1)\n\nWhen \u00afpk(xn) is non-zero this joint density induces a unique conditional density \u00afpk(Xn+1 | xn) =\n\u00afpk(Xn+1, xn)/\u00afpk(xn), which is equal to the mixture of p\u03b8 \u2208 Mk according to the posterior,\nw(\u03b8|xn) = p\u03b8(xn)w(\u03b8)/R p\u03b8(xn)w(\u03b8) d\u03b8, based on xn. Thus the Bayesian approach also de-\n\ufb01nes a prediction strategy \u00afpk(Xn+1|xn), whose corresponding distribution may be thought of as\nan estimator. From now on we sometimes call the distributions induced by \u00afp1, \u00afp2, . . . \u201cestimators\u201d,\neven if they are Bayesian. This uni\ufb01ed view is known as prequential or predictive MDL [11, 5].\nExample 1. Suppose X = {0, 1}. Then a prediction strategy \u00afp may be based on the Bernoulli\nmodel M = {p\u03b8 | \u03b8 \u2208 [0, 1]} that regards X\u221e as a sequence of independent, identically distributed\nBernoulli random variables with P\u03b8(Xn+1 = 1) = \u03b8. We may predict Xn+1 using the maximum\nlikelihood (ML) estimator based on the past, i.e. using \u02c6\u03b8(xn) = n\u22121Pn\ni=1 xi. The prediction for\nx1 is then unde\ufb01ned. If we use a smoothed ML estimator such as the Laplace estimator, \u02c6\u03b8\u2032(xn) =\n\n3\n\n\f(n + 2)\u22121(Pn\ni=1 xi + 1), then all predictions are well-de\ufb01ned. Perhaps surprisingly, the predictor\n\u00afp\u2032 de\ufb01ned by \u00afp\u2032(Xn+1 | xn) = p\u02c6\u03b8\u2032(xn)(Xn+1) equals the Bayesian predictive distribution based on\na uniform prior. Thus in this case a Bayesian predictor and an estimation-based predictor coincide!\n\nThe Switch-Distribution Suppose p1, p2, . . . is a list of prediction strategies for X\u221e. (Although\nhere the list is in\ufb01nitely long, the developments below can with little modi\ufb01cation be adjusted to the\ncase where the list is \ufb01nite.) We \ufb01rst de\ufb01ne a family Q = {qs : s \u2208 S} of combinator prediction\nstrategies that switch between the original prediction strategies. Here the parameter space S is\nde\ufb01ned as\n\nS = {(t1, k1), . . . , (tm, km) \u2208 (N \u00d7 Z+)m | m \u2208 Z+, 0 = t1 < . . . < tm}.\n\n(2)\nThe parameter s \u2208 S speci\ufb01es the identities of m constituent prediction strategies and the sample\nm\u2032 )), we\nsizes, called switch-points, at which to switch between them. For s = ((t\u2032\ni and m(s) = m\u2032. We omit the argument when the parameter s is clear\nde\ufb01ne ti(s) = t\u2032\nfrom context, e.g. we write t3 for t3(s). For each s \u2208 S the corresponding qs \u2208 Q is de\ufb01ned as:\n\ni, ki(s) = k\u2032\n\nm\u2032 , k\u2032\n\n1, k\u2032\n\n1), . . . , (t\u2032\n\nqs(Xn+1|xn) =\n\npk1(Xn+1|xn)\npk2(Xn+1|xn)\n\n...\n\nif n < t2,\nif t2 \u2264 n < t3,\n\n...\n\n(3)\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\npkm\u22121(Xn+1|xn)\npkm(Xn+1|xn)\n\nif tm\u22121 \u2264 n < tm,\nif tm \u2264 n.\n\nSwitching to the same predictor multiple times is allowed. The extra switch-point t1 is included\nto simplify notation; we always take t1 = 0. Now the switch-distribution is de\ufb01ned as a Bayesian\nmixture of the elements of Q according to a prior \u03c0 on S:\nDe\ufb01nition 1 (Switch-Distribution). Let \u03c0 be a probability mass function on S. Then the switch-\ndistribution Psw with prior \u03c0 is the distribution for X\u221e such that, for any n \u2208 Z+, the density of its\nmarginal distribution for X n is given by\n\npsw(xn) = Xs\u2208S\n\nqs(xn) \u00b7 \u03c0(s).\n\n(4)\n\nAlthough the switch-distribution provides a general way to combine prediction strategies, in this\npaper it will only be applied to combine prediction strategies \u00afp1, \u00afp2, . . . that correspond to models.\nIn this case we may de\ufb01ne a corresponding model selection criterion \u03b4sw. To this end, let Kn+1 :\nS \u2192 Z+ be a random variable that denotes the strategy/model that is used to predict Xn+1 given\npast observations xn. Formally, Kn+1(s) = ki(s) iff ti(s) \u2264 n and i = m(s) \u2228 n < ti+1(s).\nAlgorithm 1, given in Section 5, ef\ufb01ciently computes the posterior distribution on Kn+1 given xn:\n\n\u03c0(Kn+1 = k | xn) = P{s:Kn+1(s)=k} \u03c0(cid:0)s(cid:1)qs(xn)\n\npsw(xn)\n\n,\n\n(5)\n\nwhich is de\ufb01ned whenever psw(xn) is non-zero. We turn this into a model selection criterion\n\u03b4sw(xn) = arg maxk \u03c0(Kn+1 = k|xn) that selects the model with maximum posterior probability.\n\n3 Consistency\n\nIf one of the models, say with index k\u2217, is actually true, then it is natural to ask whether \u03b4sw is\nconsistent, in the sense that it asymptotically selects k\u2217 with probability 1. Theorem 1 below states\nthat this is the case under certain conditions which are only slightly stronger than those required for\nthe consistency of standard Bayes factor model selection.\nBayes factor model selection is consistent if for all k, k\u2032 6= k, \u00afPk(X\u221e) and \u00afPk\u2032(X\u221e) are mutually\nsingular, that is, if there exists a measurable set A \u2286 X \u221e such that \u00afPk(A) = 1 and \u00afPk\u2032(A) = 0 [3].\nFor example, this can usually be shown to hold if the models are nested and for each k, \u0398k is a subset\nof \u0398k+1 of wk+1-measure 0 [6]. For consistency of \u03b4sw, we need to strengthen this to the requirement\nthat, for all k\u2032\nn+1 | xn) are\nmutually singular. For example, if X1, X2, . . . are i.i.d. according to each P\u03b8 in all models, but also\nif X is countable and \u00afpk(xn+1 | xn) > 0 for all k, all xn+1 \u2208 X n+1, then this conditional mutual\nsingularity is automatically implied by ordinary mutual singularity of \u00afPk(X\u221e) and \u00afPk\u2032(X\u221e).\n\n6= k and all xn \u2208 X \u2217, the distributions \u00afPk(X\u221e\n\nn+1 | xn) and \u00afPk\u2032(X\u221e\n\n4\n\n\fLet Es = {s\u2032 \u2208 S | m(s\u2032) > m(s), (ti(s\u2032), ki(s\u2032)) = (ti(s), ki(s)) for i = 1, . . . , m(s)} denote\nthe set of all possible extensions of s to more switch-points. Let \u00afp1, \u00afp2, . . . be Bayesian prediction\nstrategies with respective parameter spaces \u03981, \u03982, . . . and priors w1, w2, . . ., and let \u03c0 be the prior\nof the corresponding switch-distribution.\nTheorem 1 (Consistency of the Switch-Distribution). Suppose \u03c0 is positive everywhere on {s \u2208\nS | m(s) = 1} and is such that there exists a positive constant c such that, for every s \u2208 S,\nc \u00b7 \u03c0(s) \u2265 \u03c0(Es). Suppose further that \u00afPk(X\u221e\nn+1 | xn) are mutually singular\nfor all k, k\u2032 \u2208 Z+, k 6= k\u2032, xn \u2208 X \u2217. Then, for all k\u2217 \u2208 Z+, for all \u03b8\u2217 \u2208 \u0398k\u2217 except for a subset of\n\u0398k\u2217 of wk\u2217-measure 0, the posterior distribution on Kn+1 satis\ufb01es\n\nn+1 | xn) and \u00afPk\u2032(X\u221e\n\n\u03c0(Kn+1 = k\u2217 | X n) n\u2192\u221e\u2212\u2192 1\n\nwith P\u03b8\u2217-probability 1.\n\nThe requirement that c \u00b7 \u03c0(s) \u2265 \u03c0(Es) is automatically satis\ufb01ed if \u03c0 is of the form:\n\n\u03c0(s) = \u03c0M(m)\u03c0K(k1)\n\nm\n\nYi=2\n\n\u03c0T(ti|ti > ti\u22121)\u03c0K(ki),\n\n(6)\n\n(7)\n\nwhere \u03c0M, \u03c0K and \u03c0T are priors on Z+ with full support, and \u03c0M is geometric: \u03c0M(m) = \u03b8m\u22121(1 \u2212 \u03b8)\nfor some 0 \u2264 \u03b8 < 1. In this case c = \u03b8/(1 \u2212 \u03b8).\n\n4 Optimal Risk Convergence Rates\n\nSuppose X1, X2, . . . are distributed according to P \u2217. We de\ufb01ne the risk at sample size n \u2265 1 of the\nestimator \u00afP relative to P \u2217 as\n\nRn(P \u2217, \u00afP ) = EX n\u22121\u223cP \u2217[D(P \u2217(Xn = \u00b7 | X n\u22121)k \u00afP (Xn = \u00b7 | X n\u22121))],\n\nwhere D(\u00b7k\u00b7) is the Kullback-Leibler (KL) divergence [4]. This is the standard de\ufb01nition of risk\nrelative to KL divergence. The risk is always well-de\ufb01ned, and equal to 0 if \u00afP (Xn+1 | X n) is\nequal to P \u2217(Xn+1 | X n). The following identity connects information-theoretic redundancy and\naccumulated statistical risk (see [4] or [6, Chapter 15]): If P \u2217 admits a density p\u2217, then for all\nprediction strategies \u00afp,\n\nEX n\u223cP \u2217[\u2212 log \u00afp(X n) + log p\u2217(X n)] =\n\nn\n\nXi=1\n\nRi(P \u2217, \u00afP ).\n\n(8)\n\nFor a union of parametric models M = Sk\u22651 Mk, we de\ufb01ne the information closure hMi =\n{P \u2217 | inf P \u2208M D(P \u2217kP ) = 0}, i.e. the set of distributions for X\u221e that can be arbitrarily well\napproximated by elements of M. Theorem 2 below shows that, for a very large class of P \u2217 \u2208 hMi,\nthe switch-distribution de\ufb01ned relative to estimators \u00afP1, \u00afP2, . . . achieves the same risk as any other\nmodel selection criterion de\ufb01ned with respect to the same estimators, up to lower order terms; in\nother words, model averaging based on the switch-distribution achieves at least the same rate of\nconvergence as model selection based on any model selection criterion whatsoever (the issue of\naveraging vs selection will be discussed at length in the full paper). The theorem requires that the\nprior \u03c0 in (4) is of the form (7), and satis\ufb01es\n\n\u2212 log \u03c0M(m) = O(m) ; \u2212 log \u03c0K(k) = O(log k) ; \u2212 log \u03c0T(t) = O(log t).\n\n(9)\nThus, \u03c0M, the prior on the total number of switch points, is allowed to decrease either polynomially\nor exponentially (as required for Theorem 1); \u03c0T and \u03c0K must decrease polynomially. For example,\nwe could set \u03c0T(t) = \u03c0K(t) = 1/(t(t + 1)), or we could take the universal prior on the integers [12].\nLet M\u2217 \u2282 hMi be some subset of interest of the information closure of model M. M\u2217 may consist\nof just a single, arbitrary distribution P \u2217 in hMi\\M \u2013 in that case Theorem 2 shows that the switch-\ndistribution converges as fast as any other model selection criterion on any distribution in hMi that\ncannot be expressed parametrically relative to M \u2013 or it may be a large, nonparametric family. In\nthat case, Theorem 2 shows that the switch-distribution achieves the minimax convergence rate. For\nexample, if the models Mk are k-bin histograms [13], then hMi contains every distribution on\n[0, 1] with bounded continuous densities, and we may, for example, take M\u2217 to be the set of all\ndistributions on [0, 1] which have a differentiable density p\u2217 such that p\u2217(x) and (d/dx)p\u2217(x) are\nbounded from below and above by some positive constants.\nWe restrict ourselves to model selection criteria which, at sample size n, never select a model Mk\nwith k > n\u03c4 for some arbitrarily large but \ufb01xed \u03c4 > 0; note that this condition will be met for most\n\n5\n\n\fpractical model selection criteria. Let h : Z+ \u2192 R+ denote the minimax optimal achievable risk as\na function of the sample size, i.e.\n\nh(n) =\n\ninf\n\n\u03b4:X n\u2192{1,2,...,\u2308n\u03c4 \u2309}\n\nsup\n\nP \u2217\u2208M\u2217\n\nsup\nn\u2032\u2265n\n\nRn\u2032(P \u2217, \u00afP\u03b4),\n\n(10)\n\n\u2208 X n\u2032, \u00afp\u03b4(Xn\u2032+1 | xn\u2032\n\n) := \u00afp\u03b4(xn)(Xn\u2032+1 | xn\u2032\n\nwhere the in\ufb01mum is over all model selection criteria restricted to sample size n, and \u2308\u00b7\u2309 denotes\nrounding up to the nearest integer. \u00afp\u03b4 is the prediction strategy satisfying, for all n\u2032 \u2265 n, all\nxn\u2032\n), i.e. at sample size n it predicts xn+1 using\n\u00afpk for the k = \u03b4(X n) chosen by \u03b4, and it keeps predicting future xn\u2032+1 by this k. We call h(n)\nthe minimax optimal rate of convergence for model selection relative to data from M\u2217, model list\nM1, M2, . . ., and estimators \u00afP1, \u00afP2, . . . The de\ufb01nition is slightly nonstandard, in that we require a\nsecond supremum over n\u2032 \u2265 n. This is needed because, as will be discussed in the full paper, it can\nsometimes happen that, for some P \u2217, some k, some n\u2032 > n, Rn\u2032(P \u2217, \u00afPk) > Rn(P \u2217, \u00afPk) (see also\n[4, Section 7.1]). In cases where this cannot happen, such as regression with standard ML estimators,\nand in cases where, uniformly for all k, supn\u2032\u2265n Rn\u2032 (P \u2217, \u00afPk)\u2212Rn(P \u2217, \u00afPk) = o(Pn\ni=1 h(i)) (in the\nfull paper we show that this holds for, for example, histogram density estimation), our Theorem 2\nalso implies minimax convergence in terms of the standard de\ufb01nition, without the supn\u2032\u2265n. We\nexpect that the supn\u2032\u2265n can be safely ignored for most \u201creasonable\u201d models and estimators.\nTheorem 2. De\ufb01ne Psw for some model class M = \u222ak\u22651Mk as in (4), where the prior \u03c0 sat-\nis\ufb01es (9). Let M\u2217 be a subset of hMi with minimax rate h such that nh(n) is increasing, and\nnh(n)/(log n)2 \u2192 \u221e. Then\n\nlim sup\nn\u2192\u221e\n\nsupP \u2217\u2208M\u2217 Pn\nPn\n\ni=1 Ri(P \u2217, Psw)\ni=1 h(i)\n\n\u2264 1.\n\n(11)\n\nThe requirement that nh(n)/(log n)2 \u2192 \u221e will typically be satis\ufb01ed whenever M\u2217 \\ M is\nnonempty. Then M\u2217 contains P \u2217 that are \u201cnonparametric\u201d relative to the chosen sequence of mod-\nels M1, M2, . . . Thus, the problem should not be \u201ctoo simple\u201d: we do not know whether (11) holds\nin the parametric setting where P \u2217 \u2208 Mk for some k on the list. Theorem 2 expresses that the\naccumulated risk of the switch-distribution, as n increases, is not signi\ufb01cantly larger than the ac-\ncumulated risk of any other procedure. This \u201cconvergence in sum\u201d has been considered before by,\nfor example, [13, 4], and is compared to ordinary convergence in the full paper, where we will also\ngive example applications of the theorem and further discuss (10). The proof works by bounding\nthe redundancy of the switch-distribution, which, by (8), is identical to the accumulated risk. It is\nnot clear whether similar techniques can be used to bound the individual risk.\n\n5 Computing the Switch-Distribution\n\nAlgorithm 1 sequentially computes the posterior probability on predictors p1, p2, . . .. It requires that\n\u03c0 is a prior of the form in (7), and \u03c0M is geometric, as is also required for Theorem 1 and permitted\nin Theorem 2. The algorithm resembles FIXED-SHARE [7], but whereas FIXED-SHARE implicitly\nimposes a geometric distribution for \u03c0T, we allow general priors by varying the shared weight with\nn. We do require slightly more space to cope with \u03c0M.\n\nAlgorithm 1 SWITCH(xN )\n\n\u22b2 K is the number of experts; \u03b8 is as in the de\ufb01nition of \u03c0M.\nfor k = 1, . . . , K do initialise wa\nReport prior \u03c0(K1) = wa\nK1\nfor n = 1, . . . , N do\n\nk \u2190 \u03b8 \u00b7 \u03c0K(k); wb\n\n(a K-sized array)\n\nk \u2190 (1 \u2212 \u03b8) \u00b7 \u03c0K(k) od\n\nk \u00b7 pk(xn|xn\u22121); wb\n\nk \u2190 wb\n\nk \u00b7 pk(xn|xn\u22121) od (loss update)\n(share update)\n\nfor k = 1, . . . , K do wa\npool \u2190 \u03c0T(Z = n | Z \u2265 n) \u00b7Pk wa\nfor k = 1, . . . , K do\n\nk \u2190 wa\n\nk\n\nk \u00b7 \u03c0T(Z 6= n | Z \u2265 n) +\n\nwa\nwb\n\nk \u2190 wa\nk \u2190 wb\n\nk\n\nod\nReport posterior \u03c0(Kn+1 | xn) = (wa\n\nod\n\n\u03b8 \u00b7 pool \u00b7 \u03c0K(k)\n+ (1 \u2212 \u03b8) \u00b7 pool \u00b7 \u03c0K(k)\n\n+ wb\n\nKn+1\n\nKn+1\n\n)/Pk(wa\n\nk + wb\nk)\n\n(a K-sized array)\n\nThis algorithm can be used to obtain fast convergence in the sense of Theorem 2, which can be\nextended to cope with a restriction to only the \ufb01rst K experts. Theorem 1 can be extended to show\n\n6\n\n\fconsistency in this case as well. If \u03c0T(Z = n | Z \u2265 n) and \u03c0K(k) can be computed in constant time,\nthen the running time is \u0398(N \u00b7 K), which is of the same order as that of fast model selection criteria\nlike AIC and BIC. We will explain this algorithm in more detail in a forthcoming publication.\n\nAcknowledgements We thank Y. Mansour, whose remark over lunch at COLT 2005 sparked off\nall this research. We thank P. Harremo\u00a8es and W. Koolen for mathematical support. This work was\nsupported in part by the IST Programme of the European Community, under the PASCAL Network\nof Excellence, IST-2002-506778. This publication only re\ufb02ects the authors\u2019 views.\n\nA Proofs\n\nProof of Theorem 1. Let Un = {s \u2208 S | Kn+1(s) 6= k\u2217} denote the set of \u2018bad\u2019 parameters s that\nselect an incorrect model. It is suf\ufb01cient to show that\n\nlim\n\nn\u2192\u221e Ps\u2208Un\n\n\u03c0(cid:0)s(cid:1)qs(X n)\nPs\u2208S \u03c0(cid:0)s(cid:1)qs(X n)\n\n= 0\n\nwith \u00afPk\u2217-probability 1.\n\n(12)\n\nTo see this, suppose the theorem is false. Then there exists a \u03a6 \u2286 \u0398k\u2217 with wk\u2217(\u03a6) > 0 such that\n(6) does not hold for any \u03b8\u2217 \u2208 \u03a6. But then by de\ufb01nition of \u00afPk\u2217 we have a contradiction with (12).\nNow let A = {s \u2208 S : km(s) 6= k\u2217} denote the set of parameters that are bad for suf\ufb01ciently large n.\nWe observe that for each s\u2032 \u2208 Un there exists at least one element s \u2208 A that uses the same sequence\nof switch-points and predictors on the \ufb01rst n + 1 outcomes (this implies that Ki(s) = Ki(s\u2032) for\ni = 1, . . . , n + 1) and has no switch-points beyond n (i.e. tm(s) \u2264 n). Consequently, either s\u2032 = s\nor s\u2032 \u2208 Es. Therefore\n\u2032)qs\n\n\u03c0(s)qs(xn).\n\n(13)\n\n\u03c0(s\n\n\u2032 (xn) \u2264 Xs\u2208A\n\n(\u03c0(s) + \u03c0(Es)) qs(xn) \u2264 (1 + c)Xs\u2208A\n\nXs\n\n\u2032\u2208Un\n\nDe\ufb01ning the mixture r(xn) = Ps\u2208A \u03c0(s)qs(xn), we will show that\n\nr(X n)\n\nlim\nn\u2192\u221e\n\n\u03c0(s = (0, k\u2217)) \u00b7 \u00afpk\u2217(X n)\n\n= 0\n\nwith \u00afPk\u2217-probability 1.\n\n(14)\n\ntm+1|xtm ) equals \u00afPkm(X\u221e\n\nUsing (13) and the fact that Ps\u2208S \u03c0(s)qs(xn) \u2265 \u03c0(s = (0, k\u2217)) \u00b7 \u00afpk\u2217(xn), this implies (12). For\nall s \u2208 A and xtm(s) \u2208 X tm(s), by de\ufb01nition Qs(X\u221e\ntm+1|xtm ), which is\nmutually singular with \u00afPk\u2217(X\u221e\ntm+1|xtm ) by assumption. If X is a separable metric space, which\nholds because X \u2286 Rd for some d \u2208 Z+, it can be shown that this conditional mutual singularity\nimplies mutual singularity of Qs(X\u221e) and \u00afPk\u2217(X\u221e). To see this for countable X , let Bxtm be any\nevent such that Qs(Bxtm |xtm ) = 1 and \u00afPk\u2217(Bxtm |xtm ) = 0. Then, for B = {y\u221e \u2208 X \u221e | y\u221e\ntm+1 \u2208\nBytm }, we have that Qs(B) = 1 and \u00afPk\u2217(B) = 0. In the uncountable case, however, B may not be\nmeasurable. We omit the full proof, which was shown to us by P. Harremo\u00a8es. Any countable mixture\nof distributions that are mutually singular with Pk\u2217, in particular R, is mutually singular with Pk\u2217.\nThis implies (14) by Lemma 3.1 of [2], which says that for any two mutually singular distributions\nR and P , the density ratio r(X n)/p(X n) goes to 0 as n \u2192 \u221e with P -probability 1.\n\nProof of Theorem 2. We will show that for every \u03b1 > 1,\n\nsup\n\nP \u2217\u2208M\u2217\n\nn\n\nXi=1\n\nRi(P \u2217, Psw) \u2264 \u03b1\n\nn\n\nXi=1\n\nh(i) + \u01eb\u03b1,n\n\nn\n\nXi=1\n\nh(i),\n\n(15)\n\nn\u2192\u221e\u2212\u2192 0, and \u01eb\u03b1,1, \u01eb\u03b1,2, . . . are \ufb01xed constants that only depend on \u03b1, but not on the\nwhere \u01eb\u03b1,n\nchosen subset M\u2217 of hMi. Theorem 2 is a consequence of (15), which we will proceed to prove.\nLet \u03b4n : X n \u2192 {1, . . . , \u2308n\u03c4 \u2309} be a model selection criterion, restricted to samples of size n, that\nis minimax optimal, i.e. it achieves the in\ufb01mum in (10). If such a \u03b4n does not exist, we take a \u03b4n\nthat is almost minimax optimal in the sense that it achieves the in\ufb01mum to within h(n)/n. For\nj \u2265 1, let tj = \u2308\u03b1j\u22121\u2309 \u2212 1. Fix an arbitrary n > 0 and let m be the unique integer such that\ntm < n \u2264 tm+1. We will \ufb01rst show that for arbitrary xn, psw achieves redundancy not much worse\nthan qs with s = (t1, k1), . . . , (tm, km), where ki = \u03b4ti (xti). Then we show that the redundancy of\nthis qs is small enough for (15) to hold. Thus, to achieve this redundancy, it is suf\ufb01cient to take only\na logarithmic number m \u2212 1 of switch-points: m \u2212 1 < log\u03b1(n + 1). Formally, we have, for some\nc > 0, uniformly for all n, xn \u2208 X n,\n\n7\n\n\f\u2212 log psw(xn) = \u2212 log Xs\n\n\u2032\u2208S\n\nqs\n\n\u2032 (xn)\u03c0(s\n\n\u2032) \u2264 \u2212 log qs(xn) \u2212 log \u03c0M(m) \u2212\n\nm\n\nXj=1\n\nlog \u03c0T(tj)\u03c0K(kj)\n\n\u2264 \u2212 log qs(xn) + c log(n + 1) + cm(\u03c4 + 1) log n = \u2212 log qs(xn) + O((log n)2).\n\n(16)\nHere the second inequality follows because of (9), and the \ufb01nal equality follows because m \u2264\nlog\u03b1(n + 1) + 1. Now \ufb01x any P \u2217 \u2208 hMi. Since P \u2217 \u2208 hMi, it must have some density p\u2217. Thus,\napplying (8), and then (16), and then (8) again, we \ufb01nd that\n\nn\n\nXi=1\n\nRi(P \u2217, Psw) = EX n\u223cP \u2217[\u2212 log psw(X n) + log p\u2217(X n)]\n\n\u2264 EX n\u223cP \u2217[\u2212 log qs(X n) + log p\u2217(X n)] + O((log n)2)\n\n=\n\nn\n\nXi=1\n\nRi(P \u2217, Qs) + O((log n)2) =\n\nm\n\nXj=1\n\nmin{tj+1,n}\n\nXi=tj +1\n\nRi(P \u2217, \u00afPkj ) + O((log n)2). (17)\n\nFor i appearing in the second sum, with tj < i \u2264 tj+1, we have Ri(P \u2217, \u00afPkj ) \u2264\nsupi\u2032\u2265tj+1 Ri\u2032 (P \u2217, \u00afPkj ) = supi\u2032\u2265tj+1 Ri\u2032(P \u2217, \u00afP\u03b4tj (xtj )) \u2264 h(tj + 1), so that\ntj+1\ntj + 1\n\n\u00b7 (tj + 1)h(tj + 1) \u2264\n\nRi(P \u2217, \u00afPkj ) \u2264\n\nh(i) \u2264 \u03b1h(i),\n\n1\n\ntj + 1\n\n1\n\ntj + 1\n\n\u00b7 ih(i) \u2264\n\nwhere the middle inequality follows because nh(n) is increasing (condition (b) of the theorem).\nRi(P \u2217, \u00afPkj ) \u2264 \u03b1Pn\ni=1 h(i). Combining this with\ni=1 h(i) + O((log n)2). Because this holds for arbi-\ntrary P \u2217 \u2208 M\u2217 (with the constant in the O notation not depending on P \u2217), (15) now follows by the\nrequirement of Theorem 2 that nh(n)/(log n)2 \u2192 \u221e.\n\nSumming over i, we get Pm\n(17), it follows that Pn\n\nj=1Pmin{tj+1,n}\ni=1 Ri(P \u2217, Psw) \u2264 \u03b1Pn\n\ni=tj +1\n\nReferences\n[1] H. Akaike. A new look at statistical model identi\ufb01cation. IEEE T. Automat. Contr., 19(6):716\u2013723, 1974.\n[2] A. Barron. Logically Smooth Density Estimation. PhD thesis, Stanford University, Stanford, CA, 1985.\n[3] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling.\n\nIEEE T. Inform. Theory, 44(6):2743\u20132760, 1998.\n\n[4] A. R. Barron. Information-theoretic characterization of Bayes performance and the choice of priors in\n\nparametric and nonparametric problems. In Bayesian Statistics 6, pages 27\u201352, 1998.\n\n[5] A. P. Dawid. Statistical theory: The prequential approach. J. Roy. Stat. Soc. A, 147, Part 2:278\u2013292, 1984.\n[6] P. D. Gr\u00a8unwald. The Minimum Description Length Principle. The MIT Press, 2007.\n[7] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32:151\u2013178, 1998.\n[8] R. E. Kass and A. E. Raftery. Bayes factors. J. Am. Stat. Assoc., 90(430):773\u2013795, 1995.\n[9] K. Li. Asymptotic optimality of cp , cl, cross-validation and generalized cross-validation: Discrete index\n\nset. Ann. Stat., 15:958\u2013975, 1987.\n\n[10] C. Monteleoni and T. Jaakkola. Online learning of non-stationary sequences.\nInformation Processing Systems, volume 16, Cambridge, MA, 2004. MIT Press.\n\nIn Advances in Neural\n\n[11] J. Rissanen. Universal coding, information, prediction, and estimation. IEEE T. Inform. Theory, IT-30(4):\n\n629\u2013636, 1984.\n\n[12] J. Rissanen. Stochastic Complexity in Statistical Inquiry. World Scienti\ufb01c, 1989.\n[13] J. Rissanen, T. P. Speed, and B. Yu. Density estimation by stochastic complexity. IEEE T. Inform. Theory,\n\n38(2):315\u2013323, 1992.\n\n[14] G. Schwarz. Estimating the dimension of a model. Ann. Stat., 6(2):461\u2013464, 1978.\n[15] R. Shibata. Asymptotic mean ef\ufb01ciency of a selection of regression variables. Ann. I. Stat. Math., 35:\n\n415\u2013423, 1983.\n\n[16] M. Stone. An asymptotic equivalence of choice of model by cross-validation and Akaike\u2019s criterion. J.\n\nRoy. Stat. Soc. B, 39:44\u201347, 1977.\n\n[17] P. Volf and F. Willems. Switching between two universal source coding algorithms. In Proceedings of the\n\nData Compression Conference, Snowbird, Utah, pages 491\u2013500, 1998.\n\n[18] V. Vovk. Derandomizing stochastic prediction strategies. Machine Learning, 35:247\u2013282, 1999.\n[19] Y. Yang. Can the strengths of AIC and BIC be shared? Biometrica, 92(4):937\u2013950, 2005.\n[20] Y. Yang. Model selection for nonparametric regression. Statistica Sinica, 9:475\u2013499, 1999.\n\n8\n\n\f", "award": [], "sourceid": 756, "authors": [{"given_name": "Tim", "family_name": "Erven", "institution": null}, {"given_name": "Steven", "family_name": "Rooij", "institution": null}, {"given_name": "Peter", "family_name": "Gr\u00fcnwald", "institution": null}]}