{"title": "Putting Bayes to sleep", "book": "Advances in Neural Information Processing Systems", "page_first": 135, "page_last": 143, "abstract": "We consider sequential prediction algorithms that are given the predictions from a set of models as inputs. If the nature of the data is changing over time in that different models predict well on different segments of the data, then adaptivity is typically achieved by mixing into the weights in each round a bit of the initial prior (kind of like a weak restart). However, what if the favored models in each segment are from a small subset, i.e. the data is likely to be predicted well by models that predicted well before? Curiously, fitting such ''sparse composite models'' is achieved by mixing in a bit of all the past posteriors. This self-referential updating method is rather peculiar, but it is efficient and gives superior performance on many natural data sets. Also it is important because it introduces a long-term memory: any model that has done well in the past can be recovered quickly. While Bayesian interpretations can be found for mixing in a bit of the initial prior, no Bayesian interpretation is known for mixing in past posteriors.  We build atop the ''specialist'' framework from the online learning literature to give the Mixing Past Posteriors update a proper Bayesian foundation. We apply our method to a well-studied multitask learning problem and obtain a new intriguing efficient update that achieves a significantly better bound.", "full_text": "Putting Bayes to sleep\n\nWouter M. Koolen\u2217\n\nDmitry Adamskiy\u2020\n\nManfred K. Warmuth\u2021\n\nAbstract\n\nWe consider sequential prediction algorithms that are given the predictions from\na set of models as inputs. If the nature of the data is changing over time in that\ndifferent models predict well on different segments of the data, then adaptivity is\ntypically achieved by mixing into the weights in each round a bit of the initial prior\n(kind of like a weak restart). However, what if the favored models in each segment\nare from a small subset, i.e. the data is likely to be predicted well by models\nthat predicted well before? Curiously, \ufb01tting such \u201csparse composite models\u201d is\nachieved by mixing in a bit of all the past posteriors. This self-referential updating\nmethod is rather peculiar, but it is ef\ufb01cient and gives superior performance on\nmany natural data sets. Also it is important because it introduces a long-term\nmemory: any model that has done well in the past can be recovered quickly. While\nBayesian interpretations can be found for mixing in a bit of the initial prior, no\nBayesian interpretation is known for mixing in past posteriors.\nWe build atop the \u201cspecialist\u201d framework from the online learning literature to give\nthe Mixing Past Posteriors update a proper Bayesian foundation. We apply our\nmethod to a well-studied multitask learning problem and obtain a new intriguing\nef\ufb01cient update that achieves a signi\ufb01cantly better bound.\n\n1\n\nIntroduction\n\nWe consider sequential prediction of outcomes y1, y2, . . . using a set of models m = 1, . . . , M for\nthis task. In practice m could range over a mix of human experts, parametric models, or even com-\nplex machine learning algorithms. In any case we denote the prediction of model m for outcome yt\ngiven past observations y<t = (y1, . . . , yt\u22121) by P (yt|y<t, m). The goal is to design a computa-\ntionally ef\ufb01cient predictor P (yt|y<t) that maximally leverages the predictive power of these models\nas measured in log loss. The yardstick in this paper is a notion of regret de\ufb01ned w.r.t. a given com-\nparator class of models or composite models: it is the additional loss of the predictor over the best\ncomparator. For example if the comparator class is the set of base models m = 1, . . . , M, then the\nregret for a sequence of T outcomes y\u2264T = (y1, . . . , yT ) is\n\nT(cid:88)\n\nt=1\n\n\u2212 ln P (yt|y<t, m).\n\nT(cid:88)\n\nt=1\n\nR :=\n\n\u2212 ln P (yt|y<t) \u2212 M\nmin\nm=1\n\nThe Bayesian predictor (detailed below) with uniform model prior has regret at most ln M for all T .\nTypically the nature of the data is changing with time: in an initial segment one model predicts\nwell, followed by a second segment in which another model has small loss and so forth. For this\nscenario the natural comparator class is the set of partition models which divide the sequence of T\noutcomes into B segments and specify the model that predicts in each segment. By running Bayes\non all exponentially many partition models comprising the comparator class, we can guarantee regret\n\n(cid:1)+B ln M, which is optimal. The goal then is to \ufb01nd ef\ufb01cient algorithms with approximately\n\nB\u22121\n\u2217Supported by NWO Rubicon grant 680-50-1010.\n\u2020Supported by Veterinary Laboratories Agency of Department for Environment, Food and Rural Affairs.\n\u2021Supported by NSF grant IIS-0917397.\n\nln(cid:0)T\u22121\n\n1\n\n\fthe same guarantee as full Bayes. In this case this is achieved by the Fixed Share [HW98] predictor.\nIt assigns a certain prior to all partition models for which the exponentially many posterior weights\ncollapse to M posterior weights that can be maintained ef\ufb01ciently. Modi\ufb01cations of this algorithm\nachieve essentially the same bound for all T , B and M simultaneously [VW98, KdR08].\nIn an open problem Yoav Freund [BW02] asked whether there are algorithms that have small regret\nagainst sparse partition models where the base models allocated to the segments are from a small\nsubset of N of the M models. The Bayes algorithm when run on all such partition models achieves\n\n(cid:1) + B ln N, but contrary to the non-sparse case, emulating this algorithm\n\nregret ln(cid:0)M\n\n(cid:1) + ln(cid:0)T\u22121\n\nN\n\nB\u22121\n\n9\n\n3\n\n7 3\n\nT outcomes\n\n7 9 7\n\nis NP-hard. However in a breakthrough paper, Bousquet and Warmuth in 2001 [BW02] gave the\nef\ufb01cient MPP algorithm with only a slightly weaker regret bound. Like Fixed Share, MPP maintains\nM \u201cposterior\u201d weights, but it instead mixes in a bit of all past posteriors in each update. This causes\nweights of previously good models to \u201cglow\u201d a little bit, even if they perform bad locally. When\nthe data later favors one of those good models, its weight is pulled up quickly. However the term\n\u201cposterior\u201d is a misnomer because no Bayesian interpretation for this curious self-referential update\nwas known. Understanding the MPP update is a very important problem because in many practical\napplications [HLSS00, GWBA02]1 it signi\ufb01cantly outperforms Fixed Share.\nOur main philosophical contribution is \ufb01nding a Bayesian interpretation for MPP. We employ the\nspecialist framework from online learning [FSSW97, CV09, CKZV10]. So-called specialist models\nare either awake or asleep. When they are awake, they predict as usual. However when they are\nasleep, they \u201cgo with the rest\u201d, i.e. they predict with the combined prediction of all awake models.\nInstead of fully coordinated partition models, we construct partition\nspecialists consisting of a base model and a set of segments where\nthis base model is awake. The \ufb01gure to the right shows how a com-\nparator partition model is assembled from partition specialists. We\ncan emulate Bayes on all partition specialists; NP-completeness is\navoided by forgoing a-priori segment synchronization. By care-\nfully choosing the prior, the exponentially many posterior weights\ncollapse to the small number of weights used by the ef\ufb01cient MPP\nalgorithm. Our analysis technique magically aggregates the contri-\nbution of the N partition specialists that constitute the comparator\npartition, showing that we achieve regret close to the regret of Bayes\nwhen run on all full partition models. Actually our new insights into\nthe nature of MPP result in slightly improved regret bounds.\nWe then apply our methods to an online multitask learning problem where a small subset of models\nfrom a big set solve a large number of tasks. Again simulating Bayes on all sparse assignments\nof models to tasks is NP-hard. We split an assignment into subset specialists that assign a single\nbase model to a subset of tasks. With the right prior, Bayes on these subset specialists again gently\ncollapses to an ef\ufb01cient algorithm with a regret bound not much larger than Bayes on all assignments.\nThis considerably improves the previous regret bound of [ABR07]. Our algorithm simply maintains\none weight per model/task pair and does not rely on sampling (often used for multitask learning).\nWhy is this line of research important? We found a new intuitive Bayesian method to quickly recover\ninformation that was learned before, allowing us to exploit sparse composite models. Moreover,\nit expressly avoids computational hardness by splitting composite models into smaller constituent\n\u201cspecialists\u201d that are asleep in time steps outside their jurisdiction. This method clearly beats Fixed\nShare when few base models constitute a partition, i.e. the composite models are sparse.\nWe expect this methodology to become a main tool for making Bayesian prediction adapt to sparse\nmodels. The goal is to develop general tools for adding this type of adaptivity to existing Bayesian\nmodels without losing ef\ufb01ciency. It also lets us look again at the updates used in Nature in a new\nlight, where species/genes cannot dare adapt too quickly to the current environment and must guard\nthemselves against an environment that changes or \ufb02uctuates at a large scale. Surprisingly these\ntype of updates might now be amenable to a Bayesian analysis. For example, it might be possible\nto interpret sex and the double stranded recessive/dominant gene device employed by Nature as a\nBayesian update of genes that are either awake or asleep.\n\n3\n\n7\n\n7\n\n7\n\n3\n\n9\n\n9\n\n1The experiments reported in [HLSS00] are based on precursors of MPP. However MPP outperforms these\n\nalgorithms in later experiments we have done on natural data for the same problem (not shown).\n\n2\n\n(a) A comparator partition model:\nsegmentation and model assignment\n\n3\n\n. .\n\nzzzz Z Z ZZZ. .\n3\n\n(b) Decomposition into 3 partition\nspecialists, asleep at shaded times\n\n\f2 Bayes and Specialists\n\nM(cid:88)\n\nm=1\n\nWe consider sequential prediction of outcomes y1, y2, . . . from a \ufb01nite alphabet. Assume that\nwe have access to a collection of models m = 1, . . . , M with data likelihoods P (y1, y2, . . .|m).\nWe then design a prior P (m) with roughly two goals in mind: the Bayes algorithm should \u201ccol-\nlapse\u201d (become ef\ufb01cient) and have a good regret bound. After observing past outcomes y<t :=\n(y1, . . . , yt\u22121), the next outcome yt is predicted by the predictive distribution P (yt|y<t), which\naverages the model predictions P (yt|y<t, m) according to the posterior distribution P (m|y<t):\n\nP (yt|y<t) =\n\nP (yt|y<t, m)P (m|y<t), where P (m|y<t) =\n\nP (y<t|m)P (m)\n\nP (y<t)\n\n.\n\n(cid:125)\n\n(cid:123)(cid:122)\n\n(cid:16) M(cid:88)\n\nm=1\n\nThe latter is conveniently updated step-wise: P (m|yt, y<t) = P (yt|y<t, m)P (m|y<t)/P (yt|y<t).\nThe log loss of the Bayesian predictor on data y\u2264T := (y1, . . . , yT ) is the cumulative loss of the\npredictive distributions and this readily relates to the cumulative loss of any model \u02c6m:\n\n(cid:17) \u2212(cid:0)\u2212lnP (y\u2264T| \u02c6m)(cid:1) \u2264 \u2212lnP ( \u02c6m).\n\n\u2212(cid:0)\u2212lnP (y\u2264T| \u02c6m)\n(cid:125)\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:80)T\nt=1 \u2212lnP (yt|y<t, \u02c6m)\n\n(cid:1) = \u2212ln\n\nP (y\u2264T|m)P (m)\n\n(cid:124)\n\u2212lnP (y\u2264T )\n(cid:80)T\nt=1 \u2212lnP (yt|y<t)\nThat is, the additional loss (or regret) of Bayes w.r.t. model \u02c6m is at most \u2212 ln P ( \u02c6m). The uniform\nprior P (m) = 1/M ensures regret at most ln M w.r.t. any model \u02c6m. This is a so-called individual\nsequence result, because no probabilistic assumptions were made on the data.\nOur main results will make essential use of the following fancier weighted notion of regret.\n\nHere U (m) is any distribution on the models and (cid:52)(cid:0)U (m)(cid:13)(cid:13)P (m)(cid:1) denotes the relative entropy\n(cid:80)M\nM(cid:88)\nU (m)(cid:0)\u2212 ln P (y\u2264T )\u2212(\u2212 ln P (y\u2264T|m))(cid:1) = (cid:52)(cid:0)U (m)(cid:13)(cid:13)P (m)(cid:1)\u2212(cid:52)(cid:0)U (m)(cid:13)(cid:13)P (m|y\u2264T )(cid:1). (1)\n\nP (m) between the distributions U (m) and P (m):\n\nm=1 U (m) ln U (m)\n\nm=1\n\nBy dropping the subtracted positive term we get an upper bound. The previous regret bound is now\nthe special case when U is concentrated on model \u02c6m. However when multiple models are good we\nachieve tighter regret bounds by letting U be the uniform distribution on all of them.\n\nSpecialists We now consider a complication of the prediction task, which was introduced in the\nonline learning literature under the name specialists [FSSW97]. The Bayesian algorithm, adapted\nto this task, will serve as the foundation of our main results. The idea is that in practice the pre-\ndictions P (yt|y<t, m) of some models may be unavailable. Human forecasters may be specialized,\nunreachable or too expensive, algorithms may run out of memory or simply take too long. We call\nmodels that may possibly abstain from prediction specialists. The question is how to produce quality\npredictions from the predictions that are available.\nWe will denote by Wt the set of specialists whose predictions are available at time t, and call them\nawake and the others asleep. The crucial idea, introduced in [CV09], is to assign to the sleeping\nspecialists the prediction P (yt|y<t). But wait! That prediction P (yt|y<t) is de\ufb01ned to average all\nmodel predictions, including those of the sleeping specialists, which we just de\ufb01ned to be P (yt|y<t):\n\nP (yt|y<t) =\n\nP (yt|y<t, m) P (m|y<t) +\n\nP (yt|y<t) P (m|y<t).\n\n(cid:88)\n\nm\u2208Wt\n\n(cid:88)\n\nm /\u2208Wt\n\nAlthough this equation is self-referential, it does have a unique solution, namely\n\nP (yt|y<t) :=\n\nm\u2208Wt\n\nP (yt|y<t, m)P (m|y<t)\nP (Wt|y<t)\n\n.\n\n(cid:80)\n\nThus the sleeping specialists are assigned the average prediction of the awake ones. This completes\nthem to full models to which we can apply the unaltered Bayesian method as before. At \ufb01rst this\nmay seem like a kludge, but actually this phenomenon arises naturally wherever concentrations are\n\n3\n\n\fmanipulated. For example, in a democracy abstaining essentially endorses the vote of the participat-\ning voters or in Nature unexpressed genes reproduce at rates determined by the active genes of the\norganism. The effect of abstaining on the update of the posterior weights is also intuitive: weights\nof asleep specialists are unaffected, whereas weights of awake models are updated with Bayes rule\nand then renormalised to the original weight of the awake set:\n\nP (m|y\u2264t) =\n\n(cid:80)\n\n=\n\nm\u2208Wt\n= P (m|y<t)\n\n\u0018\u0018\u0018\u0018\nP (yt|y<t)P (m|y<t)\n\nP (yt|y<t)\n\u0018\u0018\u0018\u0018\nP (yt|y<t)\n\n\uf8f1\uf8f2\uf8f3 P (yt|y<t,m)P (m|y<t)\n\uf8eb\uf8ed(cid:88)\n\nt\u2264T : m\u2208Wt\n\n\u2212 ln P (yt|y<t) \u2212(cid:88)\n\nM(cid:88)\n\nm=1\n\nU (m)\n\nTo obtain regret bounds in the specialist setting, we use the fact that sleeping specialists m /\u2208 Wt\nare de\ufb01ned to predict P (yt|y<t, m) := P (yt|y<t) like the Bayesian aggregate. Now (1) becomes:\nTheorem 1 ([FSSW97, Theorem 1]). Let U (m) be any distribution on a set of specialists with wake\nsets W1, W2, . . . Then for any T , Bayes guarantees\n\n\uf8f6\uf8f8 \u2264 (cid:52)(cid:0)U (m)(cid:13)(cid:13)P (m)(cid:1).\n\n\u2212 ln P (yt|y<t, m)\n\nt\u2264T : m\u2208Wt\n\nP (yt|y<t,m)P (m|y<t)\n\nP (yt|y<t,m)P (m|y<t) P (Wt|y<t)\n\nif m \u2208 Wt,\nif m /\u2208 Wt.\n\n(2)\n\n3 Sparse partition learning\n\nWe design ef\ufb01cient predictors with small regret compared to the best sparse partition model. We\ndo this by constructing partition specialists from the input models and obtain a proper Bayesian\npredictor by averaging their predictions. We consider two priors. With the \ufb01rst prior we obtain the\nMixing Past Posteriors (MPP) algorithm, giving it a Bayesian interpretation and slightly improving\nits regret bound. We then develop a new Markov chain prior. Bayes with this prior collapses to an\nef\ufb01cient algorithm for which we prove the best known regret bound compared to sparse partitions.\n\nConstruction Each partition specialist (\u03c7, m) is parameterized by a model index m and a circa-\ndian (wake/sleep pattern) \u03c7 = (\u03c71, \u03c72, . . .) with \u03c7t \u2208 {w, s}. We use in\ufb01nite circadians in order\nto obtain algorithms that do not depend on a time horizon. The wake set Wt includes all partition\nspecialists that are awake at time t, i.e. Wt := {(\u03c7, m) | \u03c7t = w}. An awake specialist (\u03c7, m)\nin Wt predicts as the base model m, i.e. P(yt|y<t, (\u03c7, m)) := P (yt|y<t, m). The Bayesian joint\ndistribution P is completed2 by choosing a prior on partition specialists. In this paper we enforce\nthe independence P(\u03c7, m) := P(\u03c7)P(m) and de\ufb01ne P(m) := 1/M uniform on the base models.\nWe now can apply Theorem 1 to bound the regret w.r.t. any partition model with time horizon T by\ndecomposing it into N partition specialists (\u03c71\u2264T , \u02c6m1), . . . , (\u03c7N\u2264T , \u02c6mN ) and choosing U (\u00b7) = 1/N\nuniform on these specialists:\n\nR \u2264 N ln\n\n\u2212 lnP(\u03c7n\u2264T ).\n\n(3)\n\nThe overhead of selecting N reference models from the pool of size M closely approximates\nthe information-theoretic ideal N ln M\nThis improves previous regret bounds\n[BW02, ABR07, CBGLS12] by an additive N ln N. Next we consider two choices for P(\u03c7): one\nfor which we retrieve MPP, and a natural one which leads to ef\ufb01cient algorithms and sharper bounds.\n\nN\n\n3.1 A circadian prior equivalent to Mixing Past Posteriors\n\nyt with Predt(yt) := (cid:80)M\n\nThe Mixing Past Posteriors algorithm is parameterized a so-called mixing scheme, which is a se-\nquence \u03b31, \u03b32, . . . of distributions, each \u03b3t with support {0, . . . , t \u2212 1}. MPP predicts outcome\nm=1 P (yt|y<t, m) vt(m), i.e. by averaging the model predictions with\n\nweights vt(m) de\ufb01ned recursively by\n\nvt(m) :=\n\n\u02dcvs+1(m) \u03b3t(s) where\n\n\u02dcv1(m) :=\n\n1\nM\n\nand\n\n\u02dcvt+1(m) :=\n\nP (yt|y<t, m)vt(m)\n\nPredt(yt)\n\n.\n\nt\u22121(cid:88)\n\ns=0\n\n2From here on we use the symbol P for the Bayesian joint to avoid a fundamental ambiguity: P(yt|y<t, m)\ndoes not equal the prediction P (yt|y<t, m) of the input model m, since it averages over both asleep and awake\nspecialists (\u03c7, m). The predictions of base models are now recovered as P(yt|y<t, Wt, m) = P (yt|y<t, m).\n\n4\n\nM\nN\n\nN(cid:88)\nN \u2248 ln(cid:0)M\n\nn=1\n\n+\n\n(cid:1).\n\n\fThe auxiliary distribution \u02dcvt+1(m) is formally the (incremental) posterior from prior vt(m). The\npredictive weights vt(m) are then the pre-speci\ufb01ed \u03b3t mixture of all such past posteriors.\nTo make the Bayesian predictor equal to MPP, we de\ufb01ne from the MPP mixing scheme a circadian\nprior measure P(\u03c7) that puts mass only on sequences with a \ufb01nite nonzero number of w\u2019s, by\n\n\u03b3sj (sj\u22121) where s\u2264J are the indices of the w\u2019s in \u03c7 and s0 = 0.\n\n(4)\n\nJ(cid:89)\n\nP(\u03c7) :=\n\n1\n\nsJ (sJ + 1)\n\nj=1\n\nWe built the independence m \u22a5 \u03c7 into the prior P(\u03c7, m) and (4) ensures \u03c7<t \u22a5 \u03c7>t | \u03c7t = w for\nall t. Since the outcomes y\u2264t are a stochastic function of m and \u03c7\u2264t, the Bayesian joint satis\ufb01es\n\n(5)\nTheorem 2. Let Predt(yt) be the prediction of MPP for some mixing scheme \u03b31, \u03b32, . . . Let\nP(yt|y<t) be the prediction of Bayes with prior (4). Then for all outcomes y\u2264t\n\nfor all t.\n\ny\u2264t, m \u22a5 \u03c7>t | \u03c7t = w\n\nPredt(yt) = P(yt|y<t).\n\nq := {\u03c7t = \u03c7q = w and \u03c7r = s for all q < r < t}\nProof. Partition the event Wt = {\u03c7t = w} into Z t\nfor all 0 \u2264 q < t, with the convention that \u03c70 = w. We \ufb01rst establish that the Bayesian joint with\nprior (4) satis\ufb01es y\u2264t \u22a5 Wt for all t. Namely, by induction on t, for all q < t\n\nP(y<t|Z t\n\nand therefore P(y\u2264t|Wt) = P(yt|y<t)(cid:80)t\u22121\n\nq) = P(y<t|y\u2264q)P(y\u2264q|Z t\nq)\nq=0 P(y<t|Z t\n\nq|Wt) = P(y\u2264t), i.e. y\u2264t \u22a5 Wt. The\ntheorem will be implied by the stronger claim vt(m) = P(m|y<t, Wt), which we again prove by\ninduction on t. The case t = 1 is trivial. For t > 1, we expand the right-hand side, apply (5), use\nthe independence we just proved, and the fact that asleep specialist predict with the rest:\n\nq)P(Z t\n\n= P(y<t|y\u2264q)P(y\u2264q|Wq) Induction= P(y<t),\n\n(5)\n\nP(m|y<t, Wt) =\n\n=\n\nP(m|y\u2264q, Wq)\n\nP(Z t\n\nq|\u0018\u0018\u0018\u0018m, y\u2264q, Wq)P(Wq|HHy\u2264q)\n\nP(Wt|HHy<t)\n\nP(y<t|Z t\n\u0018\u0018\u0018\u0018\u0018\u0018\u0018\u0018\u0018\n\nP(y<t|y\u2264q)\n\nq, m, y\u2264q)\n\nP (yq|y<q, m)P(m|y<q, Wq)\n\nP(yq|y<q)\n\nP(Z t\n\nq|Wt)\n\nt\u22121(cid:88)\nt\u22121(cid:88)\n\nq=0\n\nq=0\n\nBy (4) P(Z t\n\nq|Wt) = \u03b3t(q), and the proof is completed by applying the induction hypothesis.\n\nThe proof of the theorem provides a Bayesian interpretation of all the MPP weights: vt(m) =\nP(m|y<t, Wt) is the predictive distribution, \u02dcvt+1(m) = P(m|y\u2264t, Wt) is the posterior, and \u03b3t(q) =\nP(Z t\n\nq|Wt) is the conditional probability of the previous awake time.\n\n3.2 A simple Markov chain circadian prior\n\nIn the previous section we recovered circadian priors corresponding to the MPP mixing schemes.\nHere we design priors afresh from \ufb01rst principles. Our goal is ef\ufb01ciency and good regret bounds. A\nsimple and intuitive choice for prior P(\u03c7) is a Markov chain on states {w, s} with initial distribution\n\u03b8(\u00b7) and transition probabilities \u03b8(\u00b7|w) and \u03b8(\u00b7|s), that is\n\nP(\u03c7\u2264t) := \u03b8(\u03c71)\n\n\u03b8(\u03c7s|\u03c7s\u22121).\n\n(6)\n\ns=2\n\nBy choosing low transition probabilities we obtain a prior that favors temporal locality in that it\nallocates high probability to circadians that are awake and asleep in contiguous segments. Thus if a\ngood sparse partition model exists for the data, our algorithm will pick up on this and predict well.\nThe resulting Bayesian strategy (aggregating in\ufb01nitely many specialists) can be executed ef\ufb01ciently.\nTheorem 3. The prediction P(yt|y<t) of Bayes with Markov prior (6) equals the prediction\nPredt(yt) of Algorithm 1, which can be computed in O(M ) time per outcome using O(M ) space.\n\n5\n\nt(cid:89)\n\n\fProof. We prove by induction on t that vt(b, m) = P(\u03c7t = b, m|y<t) for each model m and\nb \u2208 {w, s}. The base case t = 1 is automatic. For the induction step we expand\nP(\u03c7t+1 = b, m|y\u2264t)\n= \u03b8(b|w)P(\u03c7t = w, m|y\u2264t) + \u03b8(b|s)P(\u03c7t = s, m|y\u2264t)\n= \u03b8(b|w)\n\n(cid:80)M\nP(\u03c7t = w, m|y<t)P (yt|y<t, m)\ni=1 P(i|\u03c7t = w, y<t)P (yt|y<t, i)\n\n+ \u03b8(b|s)P(\u03c7t = s, m|y<t).\n\n(6)\n\n(2)\n\nBy applying the induction hypothesis we obtain the update rule for vt+1(b, m).\n\nAlgorithm 1 Bayes with Markov circadian prior (6) (for Freund\u2019s problem)\n\nInput: Distributions \u03b8(\u00b7), \u03b8(\u00b7|w) and \u03b8(\u00b7|s) on {w, s}.\nInitialize v1(b, m) := \u03b8(b)/M for each model m and b \u2208 {w, s}\nfor t = 1, 2, . . . do\n\nReceive prediction P (yt|y<t, m) of each model m\n\nPredict with Predt(yt) := (cid:80)M\n\nObserve outcome yt and suffer loss \u2212 ln Predt(yt).\nUpdate vt+1(b, m) := \u03b8(b|w) P (yt|y<t,m)\n\nm=1 P (yt|y<t, m)vt(m|w) where vt(m|w) =\nPredt(yt) vt(w, m) + \u03b8(b|s)vt(s, m).\n\nend for\n\n(cid:80)M\nm(cid:48)=1 vt(w,m(cid:48))\n\nvt(w,m)\n\nB\u22121\n\n(cid:19)\n\n(cid:18) 1\n\nThe previous theorem establishes that we can predict fast. Next we show that we predict well.\nTheorem 4. Let \u02c6m1, . . . , \u02c6mT be an N-sparse assignment of M models to T times with B seg-\nments. The regret of Bayes (Algorithm 1) with tuning \u03b8(w) = 1/N, \u03b8(s|w) = B\u22121\nT\u22121 and\n\u03b8(w|s) =\n+ (N \u2212 1)(T \u2212 1)H\nR \u2264 N ln\nwhere H(p) := \u2212p ln(p) \u2212 (1 \u2212 p) ln(1 \u2212 p) is the binary entropy function.\nProof. Without generality assume \u02c6mt \u2208 {1, . . . , N}. For each reference model n pick circadian\n\u03c7n\u2264T with \u03c7n\n\nt = w iff \u02c6mt = n. Expanding the de\ufb01nition of the prior (6) we \ufb01nd\n\n(N\u22121)(T\u22121) is at most\nM\nN\n\n(cid:18) B \u2212 1\n\n(N \u2212 1)(T \u2212 1)\n\n+ (T \u2212 1)H\n\n+ N H\n\nB \u2212 1\n\nT \u2212 1\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\n\nN\n\n,\n\nP(\u03c7n\u2264T ) = \u03b8(w)\u03b8(s)N\u22121\u03b8(s|s)(N\u22121)(T\u22121)\u2212(B\u22121)\u03b8(w|w)T\u2212B\u03b8(w|s)B\u22121\u03b8(s|w)B\u22121,\n\nN(cid:89)\nThe information-theoretic ideal regret is ln(cid:0)M\n\nn=1\n\nwhich is in fact maximized by the proposed tuning. The theorem follows from (3).\n\n(cid:1) + ln(cid:0)T\u22121\n\n(cid:1) + B ln N. Theorem 4 is very close to\n\nthis except for a factor of 2 in front of the middle term; since nH(k/n) \u2264 k ln(n/k) + k we have\n\nN\n\nR \u2264 N ln\n\nM\nN\n\n+ 2\n\n(B \u2212 1) ln\n\n+ B ln N + 2B.\n\nB\u22121\nT \u2212 1\nB \u2212 1\n\nThe origin of this factor remained a mystery in [BW02], but becomes clear in our analysis: it is the\nprice of coordination between the specialists that constitute the best partition model. To see this, let\nus regard a circadian as a sequence of wake/sleep transition times. With this viewpoint (3) bounds\nthe regret by summing the prior costs of all the reference wake/sleep transition times. This means\nthat we incur overhead at each segment boundary of the comparator twice: once as the sleep time of\nthe preceding model, and once more as the wake time of the subsequent model.\nIn practice the comparator parameters T , N and B are unknown. This can be addressed by standard\northogonal techniques. Of particular interest is the method inspired by [SM99, KdR08, Koo11] of\nchanging the Markov transition probabilities as a function of time. It can be shown that by setting\n\u03b8(w) = 1/2 and increasing \u03b8(w|w) and \u03b8(s|s) as exp(\u2212\nt ln2(t+1) ) we keep the update time and\nspace of the algorithm at O(M ) and guarantee regret bounded for all T , N and B as\n+ 2N + 2(B \u2212 1) ln T + 4(B \u2212 1) ln ln(T + 1).\n\nR \u2264 N ln\n\n1\n\nAt no computational overhead, this bound is remarkably close to the fully tuned bound of the theo-\nrem above, especially when the number of segments B is modest as a function of T .\n\nM\nN\n\n6\n\n\f4 Sparse multitask learning\n\nWe transition to an extension of the sequential prediction setup called online multitask learning\n[ABR07, RAB07, ARB08, LPS09, CCBG10, SRDV11]. The new ingredient is that before predict-\ning outcome yt we are given its task number \u03bat \u2208 {1, . . . , K}. The goal is to exploit similarities\nbetween tasks. As before, we have access to M models that each issue a prediction each round. If a\nsingle model predicts well on several tasks we want to \ufb01gure this out quickly and exploit it. Simply\nignoring the task number would not result in an adaptive algorithm. Applying a separate Bayesian\npredictor to each task independently would not result in any inter-task synergy. Nevertheless, it\nwould guarantee regret at most K ln M overall. Now suppose each task is predicted well by some\nmodel from a small subset of models of size N (cid:28) M. Running Bayes on all N-sparse allocations\n\n(cid:1) + K ln N. However, emulating Bayes in this case is NP-hard [RAB07].\n\nwould achieve regret ln(cid:0)M\n\nThe goal is to design ef\ufb01cient algorithms with approximately the same regret bound.\nIn [ABR07] this multiclass problem is reduced to MPP, giving regret bound N ln M\nN + B ln N. Here\nB is the number of same-task segments in the task sequence \u03ba\u2264T . When all outcomes with the same\ntask number are consecutive, i.e. B = K, then the desired bound is achieved. However the tasks\nmay be interleaved, making the number of segments B much larger than K. We now eliminate the\ndependence on B, i.e. we solve a key open problem of [ABR07].\nWe apply the method of specialists to multitask learning, and obtain regret bounds close to the\ninformation-theoretic ideal, which in particular do not depend on the task segment count B at all.\n\nN\n\nP(S) := \u03c3(w)|S|\u03c3(s)K\u2212|S|.\n\nConstruction We create a subset specialist (S, m) for each basic model index m and subset of\ntasks S \u2286 {1, . . . , K}. At time t, specialists with the current task \u03bat in their set S are awake,\ni.e. Wt := {(S, m) | \u03bat \u2208 S}, and issue the prediction P(yt|y<t, S, m) := P (yt|y<t, m) of\nmodel m. We assign to subset specialist (S, m) prior probability P(S, m) := P(S)P(m) where\nP(m) := 1/M is uniform, and P(S) includes each task independently with some \ufb01xed bias \u03c3(w)\n(7)\nThis construction has the property that the product of prior weights of two loners ({\u03ba1}, \u02c6m) and\n({\u03ba2}, \u02c6m) is dramatically lower than the single pair specialist ({\u03ba1, \u03ba2}, \u02c6m), especially so when the\nnumber of models M is large or when we consider larger task clusters. By strongly favoring it in\nthe prior, any inter-task similarity present will be picked up fast.\nThe resulting Bayesian strategy involving M 2K subset specialists can be implemented ef\ufb01ciently.\nTheorem 5. The predictions P(yt|y<t) of Bayes with the set prior (7) equal the predictions\nPredt(yt) of Algorithm 2. They can be computed in O(M ) time per outcome using O(KM ) storage.\nt+1(m). This would be a regular Bayesian\nOf particular interest is Algorithm 2\u2019s update rule for f \u03ba\nposterior calculation if vt(m) in Predt(yt) were replaced by f \u03ba\nt (m). In fact, vt(m) is the commu-\nnication channel by which knowledge about the performance of model m in other tasks is received.\n\nProof. The resource analysis follows from inspection, noting that the update is fast because only\nthe weights f \u03ba\nt (m) associated to the current task \u03ba are changed. We prove by induction on t that\nP(m|y<t, Wt) = vt(m). In the base case t = 1 both equal 1/M. For the induction step we expand\nP(m|y\u2264t, Wt+1), which is by de\ufb01nition proportional to\n\n\u03c3(w)|S|\u03c3(s)K\u2212|S|\n\n1\nM\n\nP (yq|y<q, m)\n\nP(yq|y<q)\n\n(8)\n\nThe product form of both set prior and likelihood allows us to factor this exponential sum of products\ninto a product of binary sums. It follows from the induction hypothesis that\n\n(cid:88)\n\nS(cid:51)\u03bat+1\n\n\uf8eb\uf8ed (cid:89)\n\nq\u2264t : \u03baq\u2208S\n\n\uf8f6\uf8f8\uf8eb\uf8ed (cid:89)\n\nq\u2264t : \u03baq /\u2208S\n\n\uf8f6\uf8f8 .\n\nf k\nt (m) =\n\n\u03c3(w)\n\u03c3(s)\n\nP (yq|y<q, m)\nP(yq|y<q)\n\nThen we can divide (8) by P(y\u2264t)\u03c3(s)K and reorganize to\nP(m|y\u2264t, Wt+1) \u221d 1\nM\n\n(cid:89)\n\nf \u03bat+1\nt\n\n(m)\n\nk(cid:54)=\u03bat+1\n\n1\nM\n\nf \u03bat+1\nt\nf \u03bat+1\nt\n\n(m)\n\n(m) + 1\n\n(cid:0)f k\nt (m) + 1(cid:1)\n\nK(cid:89)\n\nk=1\n\n(cid:89)\n(cid:0)f k\nt (m) + 1(cid:1) =\n\nq\u2264t : \u03baq=k\n\n7\n\n\fSince the algorithm maintains \u03c0t(m) =(cid:81)K\n\nk=1(f k\n\nt (m) + 1) this is proportional to vt+1(m).\n\nAlgorithm 2 Bayes with set prior (7) (for online multitask learning)\n\nInput: Number of tasks K \u2265 2, distribution \u03c3(\u00b7) on {w, s}.\nInitialize f k\nfor t = 1, 2, . . . do\n\n\u03c3(s) for each task k and \u03c01(m) :=(cid:81)K\n\n1 (m) := \u03c3(w)\n\nk=1(f k\n\n1 (m) + 1).\n\nt (i) \u03c0t(i)/(f \u03ba\n\nt (m) \u03c0t(m)/(f \u03ba\ni=1 f \u03ba\n\n(cid:80)M\nObserve task index \u03ba = \u03bat.\nCompute auxiliary vt(m) := f \u03ba\nIssue prediction Predt(yt) :=(cid:80)M\nReceive prediction P (yt|y<t, m) of each model m\nObserve outcome yt and suffer loss \u2212 ln Predt(yt).\nUpdate f \u03ba\nt (m) and keep f k\nUpdate \u03c0t+1(m) :=\n\nt+1(m) := P (yt|y<t,m)\n\nPredt(yt) f \u03ba\nf \u03ba\nt+1(m)+1\nt (m)+1 \u03c0t(m).\nf \u03ba\n\nt (m)+1)\n\nt (i)+1) .\n\nm=1 P (yt|y<t, m)vt(m).\n\nend for\n\nt+1(m) := f k\n\nt (m) for all k (cid:54)= \u03ba.\n\nThe Bayesian strategy is hence emulated fast by Algorithm 2. We now show it predicts well.\nTheorem 6. Let \u02c6m1, . . . , \u02c6mK be an N-sparse allocation of M models to K tasks. With tuned\ninclusion rate \u03c3(w) = 1/N, the regret of Bayes (Algorithm 2) is bounded by\n\nR \u2264 N ln ( M /N ) + KN H(1/N ).\n\n\u02c6mk = n}. The sets Sn for n = 1, . . . , N form a partition of the K tasks. By (7)(cid:81)N\n\nProof. Without loss of generality assume that \u02c6mk \u2208 {1, . . . , N}. Let Sn := {1 \u2264 k \u2264 K |\nn=1 P(Sn) =\n\u03c3(w)K\u03c3(s)(N\u22121)K, which is maximized by the proposed tuning. The theorem now follows from\n(3).\nWe achieve the desired goal since KN H(1/N ) \u2248 K ln N. In practice N is of course unavailable\nfor tuning, and we may tune \u03c3(w) = 1/K pessimistically to get K ln K + N instead for all N\nsimultaneously. Or alternatively, we may sacri\ufb01ce some time ef\ufb01ciency to externally mix over all\nM possible values with decreasing prior, increasing the tuned regret by just ln N + O(ln ln N ).\nIf in addition the number of tasks is unknown or unbounded, we may (as done in Section 3.2)\ndecrease the membership rate \u03c3(w) with each new task encountered and guarantee regret R \u2264\nN ln(M/N ) + K ln K + 4N + 2K ln ln K where now K is the number of tasks actually received.\n\n5 Discussion\n\nWe showed that Mixing Past Posteriors is not just a heuristic with an unusual regret bound: we\ngave it a full Bayesian interpretation using specialist models. We then applied our method to a\nmultitask problem. Again an unusual algorithm resulted that exploits sparsity by pulling up the\nweights of models that have done well before in other tasks. In other words, if all tasks are well\npredicted by a small subset of base models, then this algorithm improves its prior over models as it\nlearns from previous tasks. Both algorithms closely circumvent NP-hardness. The deep question is\nwhether some of the common updates used in Nature can be brought into the Bayesian fold using\nthe specialist mechanism.\nThere are a large number of more immediate technical open problems (we just discuss a few). We\npresented our results using probabilities and log loss. However the bounds should easily carry over to\nthe typical pseudo-likelihoods employed in online learning in connection with other loss functions.\nNext, it would be worthwhile to investigate for which in\ufb01nite sets of models we can still employ our\nupdates implicitly. It was already shown in [KvE10, Koo11] that MPP can be ef\ufb01ciently emulated\non all Bernoulli models. However, what about Gaussians, exponential families in general, or even\nlinear regression? Finally, is there a Bayesian method for modeling concurrent multitasking, i.e. can\nthe Bayesian analysis be generalized to the case where a small subset of models solve many tasks in\nparallel?\n\n8\n\n\fReferences\n[ABR07]\n\n[ARB08]\n\n[BW02]\n\nJacob Ducan Abernethy, Peter Bartlett, and Alexander Rakhlin. Multitask learning\nwith expert advice. Technical report, University of California at Berkeley, January\n2007.\nAlekh Agarwal, Alexander Rakhlin, and Peter Bartlett. Matrix regularization tech-\nniques for online multitask learning, October 2008.\nOlivier Bousquet and Manfred K. Warmuth. Tracking a small set of experts by mixing\npast posteriors. Journal of Machine Learning Research, 3:363\u2013396, 2002.\n\n[CBGLS12] Nicol`o Cesa-Bianchi, Pierre Gaillard, G\u00b4abor Lugosi, and Gilles Stoltz. A new look at\n\nshifting regret. CoRR, abs/1202.3323, 2012.\n\n[CCBG10] Giovanni Cavallanti, Nicol`o Cesa-Bianchi, and Claudio Gentile. Linear algorithms for\nonline multitask classi\ufb01cation. J. Mach. Learn. Res., 11:2901\u20132934, December 2010.\n[CKZV10] Alexey Chernov, Yuri Kalnishkan, Fedor Zhdanov, and Vladimir Vovk. Supermartin-\ngales in prediction with expert advice. Theor. Comput. Sci., 411(29-30):2647\u20132669,\nJune 2010.\nAlexey Chernov and Vladimir Vovk. Prediction with expert evaluators\u2019 advice. In Pro-\nceedings of the 20th international conference on Algorithmic learning theory, ALT\u201909,\npages 8\u201322, Berlin, Heidelberg, 2009. Springer-Verlag.\n\n[CV09]\n\n[FSSW97] Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predic-\ntors that specialize. In Proc. 29th Annual ACM Symposium on Theory of Computing,\npages 334\u2013343. ACM, 1997.\n\n[GWBA02] Robert B. Gramacy, Manfred K. Warmuth, Scott A. Brandt, and Ismail Ari. Adaptive\ncaching by refetching. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer,\neditors, NIPS, pages 1465\u20131472. MIT Press, 2002.\n\n[HLSS00] David P. Helmbold, Darrell D. E. Long, Tracey L. Sconyers, and Bruce Sherrod. Adap-\ntive disk spin-down for mobile computers. ACM/Baltzer Mobile Networks and Appli-\ncations (MONET), pages 285\u2013297, 2000.\nMark Herbster and Manfred K. Warmuth. Tracking the best expert. Machine Learning,\n32:151\u2013178, 1998.\n\n[HW98]\n\n[Koo11]\n\n[KvE10]\n\n[LPS09]\n\n[SM99]\n\n[KdR08] Wouter M. Koolen and Steven de Rooij. Combining expert advice ef\ufb01ciently.\n\nIn\nRocco Servedio and Tong Zang, editors, Proceedings of the 21st Annual Conference\non Learning Theory (COLT 2008), pages 275\u2013286, June 2008.\nWouter M. Koolen. Combining Strategies Ef\ufb01ciently: High-quality Decisions from\nCon\ufb02icting Advice. PhD thesis, Institute of Logic, Language and Computation (ILLC),\nUniversity of Amsterdam, January 2011.\nWouter M. Koolen and Tim van Erven. Freezing and sleeping: Tracking experts that\nlearn by evolving past posteriors. CoRR, abs/1008.4654, 2010.\nG\u00b4abor Lugosi, Omiros Papaspiliopoulos, and Gilles Stoltz. Online multi-task learning\nwith hard constraints. In COLT, 2009.\nAlexander Rakhlin, Jacob Abernethy, and Peter L. Bartlett. Online discovery of sim-\nIn Proceedings of the 24th international conference on Machine\nilarity mappings.\nlearning, ICML \u201907, pages 767\u2013774, New York, NY, USA, 2007. ACM.\nGil I. Shamir and Neri Merhav. Low complexity sequential lossless coding for piece-\nwise stationary memoryless sources. IEEE Trans. Info. Theory, 45:1498\u20131519, 1999.\n[SRDV11] Avishek Saha, Piyush Rai, Hal Daum\u00b4e III, and Suresh Venkatasubramanian. Online\nlearning of multiple tasks and their relationships. In AISTATS, Ft. Lauderdale, Florida,\n2011.\nPaul A.J. Volf and Frans M.J. Willems. Switching between two universal source coding\nIn Proceedings of the Data Compression Conference, Snowbird, Utah,\nalgorithms.\npages 491\u2013500, 1998.\n\n[RAB07]\n\n[VW98]\n\n9\n\n\f", "award": [], "sourceid": 81, "authors": [{"given_name": "Dmitry", "family_name": "Adamskiy", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}, {"given_name": "Wouter", "family_name": "Koolen", "institution": null}]}