{"title": "Discrete MDL Predicts in Total Variation", "book": "Advances in Neural Information Processing Systems", "page_first": 817, "page_last": 825, "abstract": "The Minimum Description Length (MDL) principle selects the model that has the shortest code for data plus model. We show that for a countable class of models, MDL predictions are close to the true distribution in a strong sense. The result is completely general. No independence, ergodicity, stationarity, identifiability, or other assumption on the model class need to be made. More formally, we show that for any countable class of models, the distributions selected by MDL (or MAP) asymptotically predict (merge with) the true measure in the class in total variation distance. Implications for non-i.i.d. domains like time-series forecasting, discriminative learning, and reinforcement learning are discussed.", "full_text": "Discrete MDL Predicts in Total Variation\n\nMarcus Hutter\n\nRSISE @ ANU and SML @ NICTA\n\nCanberra, ACT, 0200, Australia\n\nmarcus@hutter1.net\n\nwww.hutter1.net\n\nAbstract\n\nThe Minimum Description Length (MDL) principle selects the model that has the\nshortest code for data plus model. We show that for a countable class of models,\nMDL predictions are close to the true distribution in a strong sense. The result\nis completely general. No independence, ergodicity, stationarity, identi\ufb01ability,\nor other assumption on the model class need to be made. More formally, we show\nthat for any countable class of models, the distributions selected by MDL (or MAP)\nasymptotically predict (merge with) the true measure in the class in total variation\ndistance. Implications for non-i.i.d. domains like time-series forecasting, discrim-\ninative learning, and reinforcement learning are discussed.\n\n1 Introduction\n\nThe minimum description length (MDL) principle recommends to use, among competing models,\nthe one that allows to compress the data+model most [Gr\u00a8u07]. The better the compression, the\nmore regularity has been detected, hence the better will predictions be. The MDL principle can be\nregarded as a formalization of Ockham\u2019s razor, which says to select the simplest model consistent\nwith the data.\nMultistep lookahead sequential prediction. We consider sequential prediction problems, i.e. hav-\ning observed sequence x\u2261(x1,x2,...,x(cid:96))\u2261x1:(cid:96), predict z\u2261(x(cid:96)+1,...,x(cid:96)+h)\u2261x(cid:96)+1:(cid:96)+h, then observe\nx(cid:96)+1 \u2208 X for (cid:96) \u2261 (cid:96)(x) = 0,1,2,.... Classical prediction is concerned with h = 1, multi-step looka-\nhead with 1 < h < \u221e, and total prediction with h = \u221e. In this paper we consider the last, hardest\ncase. An infamous problem in this category is the Black raven paradox [Mah04, Hut07]: Having\nobserved (cid:96) black ravens, what is the likelihood that all ravens are black. A more computer science\nproblem is (in\ufb01nite horizon) reinforcement learning, where predicting the in\ufb01nite future is necessary\nfor evaluating a policy. See Section 6 for these and other applications.\nDiscrete MDL. Let M = {Q1,Q2,...} be a countable class of models=theories=hypotheses=\nprobabilities over sequences X \u221e, sorted w.r.t. to their complexity=codelength K(Qi)=2log2i (say),\ncontaining the unknown true sampling distribution P . Our main result will be for arbitrary measur-\nable spaces X , but to keep things simple in the introduction, let us illustrate MDL for \ufb01nite X .\nIn this case, we de\ufb01ne Qi(x) as the Qi-probability of data sequence x\u2208X (cid:96). It is possible to code x\nin logP (x)\u22121 bits, e.g. by using Huffman coding. Since x is sampled from P , this code is optimal\n(shortest among all pre\ufb01x codes). Since we do not know P , we could select the Q\u2208M that leads to\nthe shortest code on the observed data x. In order to be able to reconstruct x from the code we need\nto know which Q has been chosen, so we also need to code Q, which takes K(Q) bits. Hence x can\nbe coded in minQ\u2208M{\u2212logQ(x)+K(Q)} bits. MDL selects as model the minimizer\n\nMDLx := arg min\n\nQ\u2208M{\u2212 log Q(x) + K(Q)}\n\nMain result. Given x, the true predictive probability of some \u201cfuture\u201d event A is P [A|x], e.g. A\ncould be x(cid:96)+1:(cid:96)+h or any other measurable set of sequences (see Section 3 for proper de\ufb01nitions).\n\n1\n\n\fWe consider the sequence of predictive measures MDLx[\u00b7|x] for (cid:96) = 0,1,2,... selected by MDL. Our\nmain result is that\n\nMDLx[\u00b7|x] converges to P [\u00b7|x] in total variation distance for (cid:96) \u2192 \u221e with P -probability 1\n\n(see Theorem 1). The analogous result for Bayesian prediction is well-known, and an immediate\ncorollary of Blackwell&Dubin\u2019s celebrated merging-of-opinions theorem [BD62]. Our primary con-\ntribution is to prove the analogous result for MDL. A priori it is not obvious that it holds at all, and\nindeed the proof turns out to be much more complex.\nMotivation. The results above hold for completely arbitrary countable model classes M. No inde-\npendence, ergodicity, stationarity, identi\ufb01ability, or other assumption need to be made.\nThe bulk of previous results for MDL are for continuous model classes [Gr\u00a8u07]. Much has been\nshown for classes of independent identically distributed (i.i.d.) random variables [BC91, Gr\u00a8u07].\nMany results naturally generalize to stationary-ergodic sequences like (kth-order) Markov. For in-\nstance, asymptotic consistency has been shown in [Bar85]. There are many applications violating\nthese assumptions, some of them are presented below and in Section 6. For MDL to work, P needs\nto be in M or at least close to some Q\u2208M, and there are interesting environments that are not even\nclose to being stationary-ergodic or i.i.d.\nNon-i.i.d. data is pervasive [AHRU09]; it includes all time-series prediction problems like weather\nforecasting and stock market prediction [CBL06]. Indeed, these are also perfect examples of non-\nergodic processes. Too much green house gases, a massive volcanic eruption, an asteroid impact,\nor another world war could change the climate/economy irreversibly. Life is also not ergodic; one\ninattentive second in a car can have irreversible consequences. Also stationarity is easily violated\nin multi-agent scenarios: An environment which itself contains a learning agent is non-stationary\n(during the relevant learning phase). Extensive games and multi-agent reinforcement learning are\nclassical examples [WR04].\nOften it is assumed that the true distribution can be uniquely identi\ufb01ed asymptotically. For non-\nergodic environments, asymptotic distinguishability can depend on the realized observations, which\nprevent a prior reduction or partitioning of M. Even if principally possible, it can be practically\nburdensome to do so, e.g. in the presence of approximate symmetries. Indeed this problem is the\nprimary reason for considering predictive MDL. MDL might never identify the true distribution, but\nour main result shows that the sequentially selected models become predictively indistinguishable.\nFor arbitrary countable model classes, the following results are known: The MDL one-step lookahead\npredictor (i.e. h = 1) of three variants of MDL converges to the true predictive distribution. The\nproof technique used in [PH05] is inherently limited to \ufb01nite h. Another general consistency result\nis presented in [Gr\u00a8u07, Thm.5.1]. Consistency is shown (only) in probability and the predictive\nimplications of the result are unclear. A stronger almost sure result is alluded to, but the given\nreference to [BC91] contains only results for i.i.d. sequences which do not generalize to arbitrary\nclasses. So existing results for discrete MDL are far less satisfactory than the elegant Bayesian\nmerging-of-opinions result.\nThe countability of M is the severest restriction of our result. Nevertheless the countable case\nd=1Md with Md = {Q\u03b8,d : \u03b8 \u2208 IRd} (say) can be\nreduced to a countable class M = {Pd} for which our result holds, where Pd is a Bayes or NML\ndMd could be reduced to a countable class by\nconsidering only computable parameters \u03b8. Essentially all interesting model classes contain such\na countable topologically dense subset. Under certain circumstances MDL still works for the non-\ncomputable parameters [Gr\u00a8u07]. Alternatively one may simply reject non-computable parameters\non philosophical grounds [Hut05]. Finally, the techniques for the countable case might aid proving\ngeneral results for continuous M, possibly along the lines of [Rya09].\nContents. The paper is organized as follows: In Section 2 we provide some insights how MDL\nworks in restricted settings, what breaks down for general countable M, and how to circumvent the\nproblems. The formal development starts with Section 3, which introduces notation and our main\nresult. The proof for \ufb01nite M is presented in Section 4 and for denumerable M in Section 5. In\nSection 6 we show how the result can be applied to sequence prediction, classi\ufb01cation and regression,\ndiscriminative learning, and reinforcement learning. Section 7 discusses some MDL variations.\n\nis useful. A semi-parametric problem class(cid:83)\u221e\nor other estimate of Md [Gr\u00a8u07]. Alternatively,(cid:83)\n\n2\n\n\f2 Facts, Insights, Problems\n\n1:(cid:96) and for h = 1 predicts xQ\n\n(cid:96)+1:(cid:96)+h may be wrong only on xQ\n\n1:\u221e, i.e. Q(xQ) = 1. Given the true observations x\u2261 xP\n\nBefore starting with the formal development, we describe how MDL works in some restricted set-\ntings, what breaks down for general countable M, and how to circumvent the problems. For deter-\nministic environments, MDL reduces to learning by elimination, and results can easily be understood.\nConsistency of MDL for i.i.d. (and stationary-ergodic) sources is also intelligible. For general M,\nMDL may no longer converge to the true model. We have to give up the idea of model identi\ufb01cation,\nand concentrate on predictive performance.\nDeterministic MDL = elimination learning. For a countable class M = {Q1,Q2,...} of de-\nterministic theories=models=hypotheses=sequences, sorted w.r.t. to their complexity=codelength\nK(Qi) = 2log2i (say) it is easy to see why MDL works: Each Q is a model for one in\ufb01nite se-\nquence xQ\n1:(cid:96) so far, MDL selects the simplest\nQ consistent with xP\n(cid:96)+1. This (and potentially other) Q becomes (forever)\ninconsistent if and only if the prediction was wrong. Assume the true model is P = Qm. Since\nelimination occurs in order of increasing index i, and Qm never makes any error, MDL makes at\nmost m\u22121 prediction errors. Indeed, what we have described is just classical Gold style learning\nby elimination. For 1 < h <\u221e, the prediction xQ\n(cid:96)+h, which causes\nh wrong predictions before the error is revealed. (Note that at time (cid:96) only xP\n(cid:96) is revealed.) Hence\nthe total number of errors is bounded by h\u00b7(m\u22121). The bound is for instance attained on the class\nconsisting of Qi = 1ih0\u221e, and the true sequence switches from 1 to 0 after having observed m\u00b7h\nones. For h =\u221e, a wrong prediction gets eventually revealed. Hence each wrong Qi (i < m) gets\neventually eliminated, i.e. P gets eventually selected. So for h =\u221e we can (still/only) show that the\nnumber of errors is \ufb01nite. No bound on the number of errors in terms of m only is possible. For\ninstance, for M ={Q1 = 1\u221e,Q2 = P = 1n0\u221e}, it takes n time steps to reveal that prediction 1\u221e is\nwrong, and n can be chosen arbitrarily large.\nComparison of deterministic\u2194probabilistic and MDL\u2194Bayes. The \ufb02avor of results carries over\nto some extent to the probabilistic case. On a very abstract level even the line of reasoning carries\nover, although this is deeply buried in the sophisticated mathematical analysis of the latter. So the\nspecial deterministic case illustrates the more complex probabilistic case. The differences are as\nfollows: In the probabilistic case, the true P can in general not be identi\ufb01ed anymore. Further, while\nthe Bayesian bound trivially follows from the 1/2-century old classical merging of opinions result\n[BD62], the corresponding MDL bound we prove in this paper is more dif\ufb01cult to obtain.\n(cid:80)(cid:96)\n(cid:80)\nConsistency of MDL for stationary-ergodic sources. For an i.i.d. class M, the law of large num-\nt=1Zt \u2192 KL(P||Q) :=\nzero, which is the case if and only if P = Q, or logP (x1:(cid:96))\u2212logQ(x1:(cid:96))\u2261(cid:80)(cid:96)\nbers applied to the random variables Zt := log[P (xt)/Q(xt)] implies 1\n(cid:96)\nP (x1)log[P (x1)/Q(x1)] with P -probability 1. Either the Kullback-Leibler (KL) divergence is\nt=1Z(cid:96)\u223cKL(P||Q)(cid:96)\u2192\n\u221e, i.e. asymptotically MDL does not select Q. For countable M, a re\ufb01nement of this argument\nshows that MDL eventually selects P [BC91]. This reasoning can be extended to stationary-ergodic\nM, but essentially not beyond. To see where the limitation comes from, we present some troubling\nexamples.\nTrouble makers. For instance, let P be a Bernoulli(\u03b80) process, but let the Q-probability that\nxt = 1 be \u03b8t, i.e. time-dependent (still assuming independence). For a suitably converging but \u201cos-\ncillating\u201d (i.e. in\ufb01nitely often larger and smaller than its limit) sequence \u03b8t \u2192 \u03b80 one can show that\nlog[P (x1:t)/Q(x1:t)] converges to but oscillates around K(Q)\u2212 K(P ) w.p.1, i.e. there are non-\nstationary distributions for which MDL does not converge (not even to a wrong distribution).\nOne idea to solve this problem is to partition M, where two distributions are in the same partition\nif and only if they are asymptotically indistinguishable (like P and Q above), and then ask MDL\nto only identify a partition. This approach cannot succeed generally, whatever particular criterion is\nused, for the following reason: Let P (x1) > 0 \u2200x1. For x1 = 1, let P and Q be asymptotically indis-\ntinguishable, e.g. P =Q on the remainder of the sequence. For x1 =0, let P and Q be asymptotically\ndistinguishable distributions, e.g. different Bernoullis. This shows that for non-ergodic sources like\nthis one, asymptotic distinguishability depends on the drawn sequence. The \ufb01rst observation can lead\nto totally different futures.\nPredictive MDL avoids trouble. The Bayesian posterior does not need to converge to a single (true\nor other) distribution, in order for prediction to work. We can do something similar for MDL. At\n\nx1\n\n3\n\n\feach time we still select a single distribution, but give up the idea of identifying a single distribu-\ntion asymptotically. We just measure predictive success, and accept in\ufb01nite oscillations. That\u2019s the\napproach taken in this paper.\n\n3 Notation and Main Result\n\nThe formal development starts with this section. We need probability measures and \ufb01lters for in\ufb01nite\nsequences, conditional probabilities and densities, the total variation distance, and the concept of\nmerging (of opinions), in order to formally state our main result.\nMeasures on sequences. Let (\u2126 := X \u221e,F,P ) be the space of in\ufb01nite sequences with natural \ufb01l-\ntration and product \u03c3-\ufb01eld F and probability measure P . Let \u03c9 \u2208 \u2126 be an in\ufb01nite sequence sam-\npled from the true measure P . Except when mentioned otherwise, all probability statements and\nexpectations refer to P , e.g. almost surely (a.s.) and with probability 1 (w.p.1) are short for with\nP -probability 1 (w.P .p.1). Let x = x1:(cid:96) = \u03c91:(cid:96) be the \ufb01rst (cid:96) symbols of \u03c9.\nFor countable X , the probability that an in\ufb01nite sequence starts with x is P (x):=P [{x}\u00d7X \u221e]. The\nconditional distribution of an event A given x is P [A|x] := P [A\u2229({x}\u00d7X \u221e)]/P (x), which exists\nw.p.1. For other probability measures Q on \u2126, we de\ufb01ne Q(x) and Q[A|x] analogously. General X\nare considered at the end of this section.\nConvergence in total variation. P is said to be absolutely continuous relative to Q, written\n\nP (cid:28) Q :\u21d4 [Q[A] = 0 implies P [A] = 0 for all A \u2208 F]\n\nP and Q are said to be mutually singular, written P\u22a5Q, iff there exists an A\u2208F for which P [A]=1\nand Q[A] = 0. The total variation distance (tvd) between Q and P given x is de\ufb01ned as\n\nd(P, Q|x) := sup\nA\u2208F\n\n(1)\nQ is said to predict P in tvd (or merge with P ) if d(P,Q|x)\u2192 0 for (cid:96)(x)\u2192\u221e with P -probability\n1. Note that this in particular implies, but is stronger than one-step predictive on- and off-sequence\nconvergence Q(x(cid:96)+1 = a(cid:96)+1|x1:(cid:96))\u2212 P (x(cid:96)+1 = a(cid:96)+1|x1:(cid:96)) \u2192 0 for any a, not necessarily equal \u03c9\n[KL94]. The famous Blackwell and Dubins convergence result [BD62] states that if P is absolutely\ncontinuous relative to Q, then (and only then [KL94]) Q merges with P :\n\nIf P (cid:28) Q then d(P, Q|x) \u2192 0 w.p.1 for\n\n(cid:96)(x) \u2192 \u221e\n\n(cid:80)\nQ\u2208MQ[A]wQ with wQ > 0 \u2200Q and(cid:80)\n\nBayesian prediction. This result can immediately be utilized for Bayesian prediction. Let M :=\n{Q1,Q2,Q3,...} be a countable (\ufb01nite or in\ufb01nite) class of probability measures, and Bayes[A] :=\nQ\u2208MwQ = 1. If the model assumption P \u2208M holds, then\nobviously P (cid:28) Bayes, hence Bayes merges with P , i.e. d(P,Bayes|x) \u2192 0 w.p.1 for all P \u2208 M.\nUnlike many other Bayesian convergence and consistency theorems, no (independence, ergodicity,\nstationarity, identi\ufb01ability, or other) assumption on the model class M need to be made. The analo-\ngous result for MDL is as follows:\nTheorem 1 (MDL predictions) Let M be a countable class of probability measures on X \u221e con-\ntaining the unknown true sampling distribution P . No (independence, ergodicity, stationarity, iden-\nti\ufb01ability, or other) assumptions need to be made on M. Let\n\n(cid:12)(cid:12)Q[A|x] \u2212 P [A|x](cid:12)(cid:12)\n\nMDLx := arg min\n\nbe the measure selected by MDL at time (cid:96) given x\u2208X (cid:96). Then the predictive distributions MDLx[\u00b7|x]\nconverge to P [\u00b7|x] in the sense that\n\nQ\u2208M{\u2212 log Q(x) + K(Q)} with (cid:88)\n(cid:12)(cid:12)MDLx[A|x] \u2212 P [A|x](cid:12)(cid:12) \u2192 0 for\n\nQ\u2208M\n\n2\u2212K(Q) < \u221e\n\n(cid:96)(x) \u2192 \u221e w.p.1\n\nQ2\u2212K(Q) \u2264 1.\n\nin which\nK(Q) is usually interpreted and de\ufb01ned as the length of some pre\ufb01x code for Q,\nQ is chosen as complexity, by Bayes rule Pr(Q|x) =\nQ(x)wQ/Bayes(x), the maximum a posteriori estimate MAPx :=argmaxQ\u2208M{Pr(Q|x)}\u2261MDLx.\nHence the theorem also applies to MAP. The proof of the theorem is surprisingly subtle and complex\ncompared to the analogous Bayesian case. One reason is that MDLx(x) is not a measure on X \u221e.\n\nIf K(Q) := log2w\u22121\n\nd(P, MDLx|x) \u2261 sup\nA\u2208F\n\ncase (cid:80)\n\n4\n\n\fArbitrary X . For arbitrary measurable spaces X , de\ufb01nitions are more subtle, essentially because\npoint probabilities Q(x) have to be replaced by probability densities relative to some base measure\nM, usually Lebesgue for X = IRd, counting measure for countable X , and e.g. M [\u00b7] = Bayes[\u00b7] for\ngeneral X . We have taken care of that all results and proofs are valid unchanged for general X ,\nwith Q(\u00b7) de\ufb01ned as a version of the Radon-Nikodym derivative relative to M. We spare the reader\nthe details, since they are completely standard and do not add any value to this paper, and space is\nlimited. The formal de\ufb01nitions of Q(x) and Q[A|x] can be found e.g. in [Doo53, BD62]. Note that\nMDLx is independent of the particular choice of M.\n\n4 Proof for Finite Model Class\nWe \ufb01rst prove Theorem 1 for \ufb01nite model classes M. For this we need the following De\ufb01nition and\nLemma:\nDe\ufb01nition 2 (Relations between Q and P ) For any probability measures Q and P , let\n\nderivative, i.e. Qr[A] =(cid:82)\n\nnon-negative measure Qr (cid:28) P and a singular non-negative measure Qs\u22a5P .\n\n\u2022 Qr+Qs =Q be the Lebesgue decomposition of Q relative to P into an absolutely continuous\n\u2022 g(\u03c9) := dQr/dP = lim(cid:96)\u2192\u221e[Q(x1:(cid:96))/P (x1:(cid:96))] be (a version of) the Radon-Nikodym\n\u2022 \u2126\u25e6 := {\u03c9 : Q(x1:(cid:96))/P (x1:(cid:96))\u2192 0} \u2261 {\u03c9 : g(\u03c9) = 0}.\n\u2022 (cid:126)\u2126 := {\u03c9 : d(P,Q|x)\u2192 0 for (cid:96)(x)\u2192\u221e}.\n\nAg dP .\n\n(cid:96)\n\nIt is well-known that the Lebesgue decomposition exists and is unique. The representation of\nthe Radon-Nikodym derivative as a limit of local densities can e.g. be found in [Doo53, VII\u00a78]:\nZ r/s\n(\u03c9) := Qr/s(x1:(cid:96))/P (x1:(cid:96)) for (cid:96) = 1,2,3,... constitute two martingale sequences, which con-\nverge w.p.1. Qr (cid:28) P implies that the limit Z r\u221e is the Radon-Nikodym derivative dQr/dP . (Indeed,\nDoob\u2019s martingale convergence theorem can be used to prove the Radon-Nikodym theorem.) Qs\u22a5P\nimplies Z r\u221e = 0 w.p.1. So g is uniquely de\ufb01ned and \ufb01nite w.p.1.\nLemma 3 (Generalized merging of opinions) For any Q and P , the following holds:\n\n(i) P (cid:28) Q if and only if P [\u2126\u25e6] = 0\n(ii) P [\u2126\u25e6] = 0 implies P [(cid:126)\u2126] = 1\n(iii) P [\u2126\u25e6\u222a(cid:126)\u2126] = 1\n\n[(i)+[BD62]]\n[generalizes (ii)]\n\nP [\u2126\u25e6] = 0. Therefore P (cid:28) Q.\n\n(i) says that Q(x)/P (x) converges almost surely to a strictly positive value if and only if P is\nabsolutely continuous relative to Q, (ii) says that an almost sure positive limit of Q(x)/P (x) implies\nthat Q merges with P . (iii) says that even if P (cid:54)(cid:28) Q, we still have d(P,Q|x)\u2192 0 on almost every\nsequence that has a positive limit of Q(x)/P (x).\nProof. Recall De\ufb01nition 2.\n\n(i\u21d0) Assume P [\u2126\u25e6]=0: P [A]>0 implies Q[A]\u2265Qr[A]=(cid:82)\n(i\u21d2) Assume P (cid:28) Q: Choose a B for which P [B] = 1 and Qs[B] = 0. Now Qr[\u2126\u25e6] =(cid:82)\n\n\u2126\u25e6 g dP = 0\nimplies 0 \u2264 Q[B\u2229\u2126\u25e6] \u2264 Qs[B]+Qr[\u2126\u25e6] = 0+0. By P (cid:28) Q this implies P [B\u2229\u2126\u25e6] = 0, hence\nP [\u2126\u25e6] = 0.\n(ii) That P (cid:28) Q implies P [(cid:126)\u2126] = 1 is Blackwell-Dubins\u2019 celebrated result. The result now follows\nfrom (i).\n(iii) generalizes [BD62]. For P [\u2126\u25e6] = 0 it reduces to (ii). The case P [\u2126\u25e6] = 1 is trivial. Therefore\nwe can assume 0 < P [\u2126\u25e6] < 1. Consider measure P (cid:48)[A] := P [A|B] conditioned on B := \u2126\\\u2126\u25e6.\n\nAg dP >0, since g>0 a.s. by assumption\n\nA\\\u2126\u25e6 g dP . Since g >0 outside\n\u2126\u25e6, this implies P [A\\\u2126\u25e6] = 0. So P (cid:48)[A] = P [A\u2229B]/P [B] = P [A\\\u2126\u25e6]/P [B] = 0. Hence P (cid:48) (cid:28) Q.\nNow (ii) implies d(P (cid:48),Q|x) \u2192 0 with P (cid:48) probability 1. Since P (cid:48) (cid:28) P we also get d(P (cid:48),P|x) \u2192 0\nw.P (cid:48).p.1.\nTogether this implies 0\u2264d(P,Q|x)\u2264d(P (cid:48),P|x)+d(P (cid:48),Q|x)\u21920 w.P (cid:48).p.1, i.e. P (cid:48)[(cid:126)\u2126]=1. The claim\nnow follows from\n\nAssume Q[A]=0. Using(cid:82)\n\n\u2126\u25e6 g dP =0, we get 0=Qr[A]=(cid:82)\n\nAg dP =(cid:82)\n\n5\n\n\fP [\u2126\u25e6 \u222a (cid:126)\u2126] = P (cid:48)[\u2126\u25e6 \u222a (cid:126)\u2126]P [\u2126 \\ \u2126\u25e6] + P [\u2126\u25e6 \u222a (cid:126)\u2126|\u2126\u25e6]P [\u2126\u25e6]\n\n= 1 \u00b7 P [\u2126 \\ \u2126\u25e6] + 1 \u00b7 P [\u2126\u25e6] = P [\u2126] = 1\n\nThe intuition behind the proof of Theorem 1 is as follows. MDL will asymptotically not select Q\nfor which Q(x)/P (x)\u2192 0. Hence for those Q potentially selected by MDL, we have \u03c9(cid:54)\u2208 \u2126\u25e6, hence\n\u03c9 \u2208 (cid:126)\u2126, for which d(P,Q|x)\u2192 0 (a.s.). The technical dif\ufb01culties are for \ufb01nite M that the eligible Q\ndepend on the sequence \u03c9, and for in\ufb01nite M to deal with non-uniformly converging d, i.e. to infer\nd(P,MDLx|x)\u2192 0.\nProof of Theorem 1 for \ufb01nite M. Recall De\ufb01nition 2, and let gQ,\u2126\u25e6\nQ,(cid:126)\u2126Q refer to some Q\u2208M\u2261\n{Q1,...,Qm}. The set of sequences \u03c9 for which some gQ for some Q \u2208 M is unde\ufb01ned has P -\nmeasure zero, and hence can be ignored. Fix some sequence \u03c9 \u2208 \u2126 for which gQ(\u03c9) is de\ufb01ned for\nall Q\u2208M, and let M\u03c9 :={Q\u2208M : gQ(\u03c9) = 0}.\n\nMDLx := arg min\n\nQ\u2208M LQ(x), where LQ(x) := \u2212 log Q(x) + K(Q).\n\nConsider the difference\n\n+ K(Q) \u2212 K(P ) (cid:96)\u2192\u221e\u2212\u2192 \u2212 log gQ(\u03c9) + K(Q) \u2212 K(P )\n\nLQ(x) \u2212 LP (x) = \u2212 log\n\nQ(x)\nP (x)\nFor Q\u2208M\u03c9, the r.h.s. is +\u221e, hence\n\nSince M is \ufb01nite, this implies\n\n\u2200Q\u2208M\u03c9 \u2203(cid:96)Q\u2200(cid:96) > (cid:96)Q : LQ(x) > LP (x)\n\n\u2200(cid:96) > (cid:96)0 \u2200Q\u2208M\u03c9 : LQ(x) > LP (x), where\n\nfocus on Q\u2208M\u03c9 :=M\\M\u03c9. Let \u21261 :=(cid:84)\n\nTherefore, since P \u2208 M, we have MDLx (cid:54)\u2208 M\u03c9 \u2200(cid:96) > (cid:96)0, so we can safely ignore all Q \u2208 M\u03c9 and\nQ\u222a(cid:126)\u2126Q). Since P [\u21261] = 1 by Lemma 3(iii), we\ncan also assume \u03c9\u2208 \u21261.\n\n(cid:96)0 := max{(cid:96)Q : Q \u2208 M\u03c9} < \u221e\n\n(\u2126\u25e6\n\nQ\u2208M\u03c9\nQ \u2208 M\u03c9 \u21d2 gQ(\u03c9) > 0 \u21d2 \u03c9 (cid:54)\u2208 \u2126\u25e6\nd(P, MDLx|x) \u2264 sup\nQ\u2208M\u03c9\n\nThis implies\nwhere the inequality holds for (cid:96) > (cid:96)0 and the limit holds, since M is \ufb01nite. Since the set of \u03c9\nexcluded in our considerations has measure zero, d(P,MDLx|x) \u2192 0 w.p.1, which proves the\ntheorem for \ufb01nite M.\n\nd(P, Q|x) \u2192 0\n\nQ \u21d2 \u03c9 \u2208 (cid:126)\u2126Q \u21d2 d(P, Q|x) \u2192 0\n\n5 Proof for Countable Model Class\nThe proof in the previous Section crucially exploited \ufb01niteness of M. We want to prove that the\nprobability that MDL asymptotically selects \u201ccomplex\u201d Q is small. The following Lemma estab-\nlishes that the probability that MDL selects a speci\ufb01c complex Q in\ufb01nitely often is small.\n\nLemma 4 (MDL avoids complex probability measures Q) For any Q and P we have\nP [Q(x)/P (x)\u2265 c in\ufb01nitly often]\u2264 1/c.\nQ(x)\nP (x)\n\nP [\u2200(cid:96)0\u2203(cid:96) > (cid:96)0 :\n\n\u2265 c] \u2264\n\nProof.\n\n\u2265 c]\n\n(b)\u2264 1\nc\n\nE[lim\n\n(cid:96)\n\nQ(x)\nP (x)\n\n]\n\n(c)\n=\n\n1\nc\n\nlim\n\nE[\n\n(cid:96)\n\nQ(x)\nP (x)\n\n]\n\n(e)\u2261 1\nc\n\n(a)\n= P [ lim\n(cid:96)\u2192\u221e\nQ(x)\nP (x)\n\nE[lim\n\n(cid:96)\n\nQ(x)\nP (x)\n(d)\u2264 1\nc\n\n]\n\n(a) is true by de\ufb01nition of the limit superior lim, (b) is Markov\u2019s inequality, (c) exploits the fact that\nthe limit of Q(x)/P (x) exists w.p.1, (d) uses Fatou\u2019s lemma, and (e) is obvious.\n\nFor suf\ufb01ciently complex Q, Lemma 4 implies that LQ(x) > LP (x) for most x. Since convergence is\nnon-uniform in Q, we cannot apply the Lemma to all (in\ufb01nitely many) complex Q directly, but need\nto lump them into one \u00afQ.\n\n6\n\n\fProof of Theorem 1 for countable M. Let the Q \u2208 M = {Q1,Q2,...} be ordered somehow,\ne.g. in increasing order of complexity K(Q), and P = Qn. Choose some (large) m \u2265 n and let\n\n(cid:102)M :={Qm+1,Qm+2,...} be the set of \u201ccomplex\u201d Q. We show that the probability that MDL selects\nP [MDLx \u2208 (cid:102)M in\ufb01nitely often] \u2261 P [\u2200(cid:96)0\u2203(cid:96) > (cid:96)0 : MDLx \u2208 (cid:102)M]\n\u2264 P [\u2200(cid:96)0\u2203(cid:96) > (cid:96)0 \u2227 Q \u2208 (cid:102)M : LQ(x) \u2264 LP (x)] = P [\u2200(cid:96)0\u2203(cid:96) > (cid:96)0 : sup\n\nin\ufb01nitely often complex Q is small:\n\nP (x) 2K(P )\u2212K(Qi) \u2265 1]\n\nQi(x)\n\ni>m\n\n(a)\u2264 P [\u2200(cid:96)0\u2203(cid:96) > (cid:96)0 :\n\n\u00afQ(x)\n\nP (x) \u03b4 2K(P ) \u2265 1]\n\n(b)\u2264 \u03b4 2K(P )\n\n(c)\u2264 \u03b5\n\nThe \ufb01rst three relations follow immediately from the de\ufb01nition of the various quantities. Bound (a)\nis the crucial \u201clumping\u201d step. First we bound\n2\u2212K(Qi) \u2264\n\n2\u2212K(Qi) = \u03b4\n\n\u221e(cid:88)\n\nQi(x)\nP (x)\n\n(cid:88)\n\nsup\ni>m\n\n\u03b4 :=\n\nQi(x)\nP (x)\n\ni=m+1\n\n(cid:88)\n\n,\n\n\u00afQ(x)\nP (x)\nQi(x)2\u2212K(Qi),\n\n1\n\u03b4\n\n2\u2212K(Qi) < \u221e,\n\n\u00afQ(x) :=\n\ni>m\n\nWhile MDL\u00b7[\u00b7] is not a (single) measure on \u2126 and hence dif\ufb01cult to deal with, \u00afQ is a proper prob-\nability measure on \u2126. In a sense, this step reduces MDL to Bayes. Now we apply Lemma 4 in (b)\nto the (single) measure \u00afQ. The bound (c) holds for suf\ufb01ciently large m = m\u03b5(P ), since \u03b4 \u2192 0 for\nm\u2192\u221e. This shows that for the sequence of MDL estimates\n\ni>m\n\n{MDLx1:(cid:96) : (cid:96) > (cid:96)0} \u2286 {Q1, ..., Qm} with probability at least 1 \u2212 \u03b5\n\nHence the already proven Theorem 1 for \ufb01nite M implies that d(P,MDLx|x)\u2192 0 with probability\nat least 1\u2212\u03b5. Since convergence holds for every \u03b5 > 0, it holds w.p.1.\n\n6 Implications\n\nDue to its generality, Theorem 1 can be applied to many problem classes. We illustrate some imme-\ndiate implications of Theorem 1 for time-series forecasting, classi\ufb01cation, regression, discriminative\nlearning, and reinforcement learning.\nTime-series forecasting. Classical online sequence prediction is concerned with predicting x(cid:96)+1\nfrom (non-i.i.d.) sequence x1:(cid:96) for (cid:96) = 1,2,3,.... Forecasting farther into the future is possible by\npredicting x(cid:96)+1:(cid:96)+h for some h > 0. Hence Theorem 1 implies good asymptotic (multi-step) predic-\ntions. Of\ufb02ine learning is concerned with training a predictor on x1:(cid:96) for \ufb01xed (cid:96) in-house, and then\nselling and using the predictor on x(cid:96)+1:\u221e without further learning. Theorem 1 shows that for enough\ntraining data, predictions \u201cpost-learning\u201d will be good.\nClassi\ufb01cation and Regression. In classi\ufb01cation (discrete X ) and regression (continuous X ), a sam-\nple is a set of pairs D = {(y1,x1),...,(y(cid:96),x(cid:96))}, and a functional relationship \u02d9x = f ( \u02d9y)+noise, i.e. a\nconditional probability P ( \u02d9x| \u02d9y) shall be learned. For reasons apparent below, we have swapped the\nusual role of \u02d9x and \u02d9y. The dots indicate \u02d9x \u2208 X and \u02d9y \u2208 Y), while x = x1:(cid:96) \u2208 X (cid:96) and y = y1:(cid:96) \u2208 Y (cid:96).\nIf we assume that also \u02d9y follows some distribution, and start with a countable model class M of\njoint distributions Q( \u02d9x, \u02d9y) which contains the true joint distribution P ( \u02d9x, \u02d9y), our main result implies\nthat MDLD[( \u02d9x, \u02d9y)|D] converges to the true distribution P ( \u02d9x, \u02d9y). Indeed since/if samples are assumed\ni.i.d., we don\u2019t need to invoke our general result.\nDiscriminative learning. Instead of learning a generative [Jeb03] joint distribution P ( \u02d9x, \u02d9y), which\nrequires model assumptions on the input \u02d9y, we can discriminatively [LSS07] learn P (\u00b7| \u02d9y) directly\nwithout any assumption on y (not even i.i.d). We can simply treat y1:\u221e as an oracle to all Q, de\ufb01ne\nM(cid:48) ={Q(cid:48)} with Q(cid:48)(x) := Q(x|y1:\u221e), and apply our main result to M(cid:48), leading to MDL(cid:48)x[A|x]\u2192\nP (cid:48)[A|x], i.e. MDLx|y1:\u221e [A|x,y1:\u221e] \u2192 P [A|x,y1:\u221e]. If y1,y2,... are conditionally independent, or\nmore generally for any causal process, we have Q(x|y) = Q(x|y1:\u221e). Since the x given y are not\nidentically distributed, classical MDL consistency results for i.i.d. or stationary-ergodic sources do\nnot apply. The following corollary formalizes our \ufb01ndings:\nCorollary 5 (Discriminative MDL) Let M (cid:51) P be a class of discriminative causal distributions\nQ[\u00b7|y1:\u221e], i.e. Q(x|y1:\u221e) = Q(x|y), where x = x1:(cid:96) and y = y1:(cid:96). Regression and classi\ufb01cation are\n\n7\n\n\ftypical examples. Further assume M is countable. Let MDLx|y := argminQ\u2208M{\u2212logQ(x|y)+\nK(Q)} be the discriminative MDL measure (at time (cid:96) given x,y). Then supA\n\nP [A|x,y](cid:12)(cid:12)\u2192 0 for (cid:96)(x)\u2192\u221e, P [\u00b7|y1:\u221e] almost surely, for every sequence y1:\u221e.\n\n(cid:12)(cid:12)MDLx|y[A|x,y]\u2212\n\nFor \ufb01nite Y and conditionally independent x, the intuitive reason how this can work is as follows:\nIf \u02d9y appears in y1:\u221e only \ufb01nitely often, it plays asymptotically no role; if it appears in\ufb01nitely often,\nthen P (\u00b7| \u02d9y) can be learned. For in\ufb01nite Y and deterministic M, the result is also intelligible: Every\n\u02d9y might appear only once, but probing enough function values xt = f (yt) allows to identify the\nfunction.\nReinforcement learning (RL). In the agent framework [RN03], an agent interacts with an envi-\nronment in cycles. At time t, an agent chooses an action yt based on past experience x