{"title": "Lifelong Learning with Non-i.i.d. Tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 1540, "page_last": 1548, "abstract": "In this work we aim at extending theoretical foundations of lifelong learning. Previous work analyzing this scenario is based on the assumption that the tasks are sampled i.i.d. from a task environment or limited to strongly constrained data distributions. Instead we study two scenarios when lifelong learning is possible, even though the observed tasks do not form an i.i.d. sample: first, when they are sampled from the same environment, but possibly with dependencies, and second, when the task environment is allowed to change over time. In the first case we prove a PAC-Bayesian theorem, which can be seen as a direct generalization of the analogous previous result for the i.i.d. case. For the second scenario we propose to learn an inductive bias in form of a transfer procedure. We present a generalization bound and show on a toy example how it can be used to identify a beneficial transfer algorithm.", "full_text": "Lifelong Learning with Non-i.i.d. Tasks\n\nAnastasia Pentina\n\nIST Austria\n\nKlosterneuburg, Austria\napentina@ist.ac.at\n\nChristoph H. Lampert\n\nIST Austria\n\nKlosterneuburg, Austria\n\nchl@ist.ac.at\n\nAbstract\n\nIn this work we aim at extending the theoretical foundations of lifelong learning.\nPrevious work analyzing this scenario is based on the assumption that learning\ntasks are sampled i.i.d. from a task environment or limited to strongly constrained\ndata distributions. Instead, we study two scenarios when lifelong learning is pos-\nsible, even though the observed tasks do not form an i.i.d. sample: \ufb01rst, when they\nare sampled from the same environment, but possibly with dependencies, and sec-\nond, when the task environment is allowed to change over time in a consistent\nway. In the \ufb01rst case we prove a PAC-Bayesian theorem that can be seen as a\ndirect generalization of the analogous previous result for the i.i.d. case. For the\nsecond scenario we propose to learn an inductive bias in form of a transfer proce-\ndure. We present a generalization bound and show on a toy example how it can be\nused to identify a bene\ufb01cial transfer algorithm.\n\n1\n\nIntroduction\n\nDespite the tremendous growth of available data over the past decade, the lack of fully annotated\ndata, which is an essential part of success of any traditional supervised learning algorithm, demands\nmethods that allow good generalization from limited amounts of training data. One way to approach\nthis is provided by the lifelong learning (or learning to learn [1]) paradigm, which is based on the\nidea of accumulating knowledge over the course of learning multiple tasks in order to improve the\nperformance on future tasks.\nIn order for this scenario to make sense one has to de\ufb01ne what kind of relations connect the observed\ntasks with the future ones. The \ufb01rst formal model of lifelong learning was proposed by Baxter\nin [2]. He introduced the notion of task environment \u2013 a set of all tasks that may need to be solved\ntogether with a probability distribution over them. In Baxter\u2019s model the lifelong learning system\nobserves tasks that are sampled i.i.d. from the task environment. This allows proving bounds in\nthe PAC framework [3, 4] that guarantee that a hypothesis set or inductive bias that works well on\nthe observed tasks will also work well on future tasks from the same environment. Baxter\u2019s results\nwere later extended using algorithmic stability [5], task similarity measures [6], and PAC-Bayesian\nanalysis [7]. Speci\ufb01c cases that were studied include feature learning [8] and sparse coding [9].\nAll these works, however, assume that the observed tasks are independently and identically dis-\ntributed, as the original work by Baxter did. This assumption allows making predictions about the\nfuture of the learning process, but it limits the applicability of the results in practice. To our knowl-\nedge, only the recent [10] has studied lifelong learning without an i.i.d. assumption. However, the\nconsidered framework is limited to binary classi\ufb01cation with linearly separable classes and isotropic\nlog-concave data distributions.\nIn this work we use the PAC-Bayesian framework to study two possible relaxations of the i.i.d. as-\nsumption without restricting the class of possible data distributions. First, we study the case in which\ntasks can have dependencies between them, but are still sampled from a \ufb01xed task environment. An\n\n1\n\n\fillustrative example would be when task are to predict the outcome of chess games. Whenever a\nplayer plays multiple games the corresponding tasks are not be independent. In this setting we re-\ntain many concepts of [7] and learn an inductive bias in the form of a probability distribution. We\nprove a bound relating the expected error when relying on the learned bias for future tasks to its\nempirical error over the observed tasks. It has the same form as for the i.i.d. situation, except for a\nslowdown of convergence proportional to a parameter capturing the amount of dependence between\ntasks.\nSecond, we introduce a new and more \ufb02exible lifelong learning setting, in which the learner observes\na sequence of tasks from different task environments. This could be, e.g., classi\ufb01cation tasks of in-\ncreasing dif\ufb01culty. In this setting one cannot expect that transferring an inductive bias from observed\ntasks to future tasks will be bene\ufb01cial, because the task environment is not stationary. Instead, we\naim at learning an effective transfer algorithm: a procedure that solves a task taking information\nfrom a previous task into account. We bound the expected performance of such algorithms when\napplied to future tasks based on their performance on the observed tasks.\n\n2 Preliminaries\n\n1, yi\n\nm, yi\n\n1), . . . , (xi\n\nFollowing Baxter\u2019s model [2] we assume that all tasks that may need to be solved share the same\ninput space X and output space Y. The lifelong learning system observes n tasks t1, . . . , tn in form\nm)} is a set of m points sampled\nof training sets S1, . . . , Sn, where each Si = {(xi\ni.i.d. from the corresponding unknown data distribution Di over X \u00d7Y. In contrast to previous works\non lifelong learning [2, 5, 8] we omit the assumption that the observed tasks are independently and\nidentically distributed.\nIn order to theoretically analyze lifelong learning in the case of non-i.i.d. tasks we use techniques\nfrom PAC-Bayesian theory [11]. We assume that the learner uses the same hypothesis set H =\n{h : X \u2192 Y} and the same loss function (cid:96) : Y \u00d7 Y \u2192 [0, 1] for solving all tasks. PAC-Bayesian\ntheory studies the performance of randomized, Gibbs, predictors. Formally, for any probability\ndistribution Q over the hypothesis set, the corresponding Gibbs predictor for every point x \u2208 X\nrandomly samples h \u223c Q and returns h(x). The expected loss of such Gibbs predictor on a task\ncorresponding to a data distribution D is given by:\n\nand its empirical counterpart based on a training set S sampled from Dm is given by:\n\ner(Q) = Eh\u223cQE(x,y)\u223cD(cid:96)(h(x), y)\n\n(cid:98)er(Q) = Eh\u223cQ\n\n1\nm\n\nm(cid:88)\n\ni=1\n\n(cid:96)(h(xi), yi).\n\n(1)\n\n(2)\n\nPAC-Bayesian theory allows us to obtain upper bounds on the difference between these two quanti-\nties of the following form:\nTheorem 1. Let P be any distribution over H, \ufb01xed before observing the sample S. Then for any\n\u03b4 > 0 the following holds uniformly for all distributions Q over H with probability at least 1 \u2212 \u03b4:\n\ner(Q) \u2264 (cid:98)er(Q) +\n\n1\u221a\nm\n\nKL(Q||P ) +\n\n1 + 8 log 1/\u03b4\n\n\u221a\n8\n\nm\n\n,\n\n(3)\n\nwhere KL denotes the Kullback-Leibler divergence.\n\nThe distribution P should be chosen before observing any data and therefore is usually referred\nas prior distribution. In contrast, the bound holds uniformly with respect to the distributions Q.\nWhenever it consists only of computable quantities, it can be used to choose a data-dependent Q that\nminimizes the right hand side of the inequality (3) and thus provides a Gibbs predictor with expected\nerror bounded by a hopefully low value. Suchwise Q is usually referred as a posterior distribution.\nNote that besides explicit bounds, such as (3), in the case of 0/1-loss one can also derive implicit\n\nbound that can be tighter in some regimes [12]. Instead of the error difference, er\u2212(cid:98)er, these bound\ntheir KL-divergence, kl((cid:98)er(cid:107) er), where kl(q(cid:107)p) denotes the KL-divergence between two Bernoulli\n\nrandom variables with success probabilities p and q. In this work, we prefer explicit bounds as they\nare more intuitive and allow for more freedom in the choice of different loss functions. They also\nallow us to combine several inequalities in an additive way, which we make use of in Sections 3\nand 4.\n\n2\n\n\f3 Dependent tasks\n\n1\nn\n\ni=1\n\nn(cid:88)\n\n(cid:101)er(Q) = EP\u223cQ\n\n1\nn\n\n(5)\n\n(6)\n\nThe \ufb01rst extension of Baxter\u2019s model that we study is the case, when the observed tasks are sampled\nfrom the same task environment, but with some interdependencies. In other words, in this case the\ntasks are identically, but not independently, distributed.\nSince the task environment is assumed to be constant we can build on ideas from the situation of i.i.d.\ntasks in [7]. We assume that for all tasks the learner uses the same deterministic learning algorithm\nthat produces a posterior distribution Q based on a prior distribution P and a sample set S. We also\nassume that there is a set of possible prior distributions and some hyper-prior distribution P over it.\nThe goal of the learner is to \ufb01nd a hyper-posterior distribution Q over this set such that, when the\nprior is sampled according to Q, the expected loss on the next, yet unobserved task is minimized:\n(4)\n\nThe empirical counterpart of the above quantity is given by:\n\ner(Q) = EP\u223cQE(t,St)Eh\u223cQ(P,St)E(x,y)\u223cDt(cid:96)(h(x), y).\n(cid:98)er(Q) = EP\u223cQ\n\nEh\u223cQi(P,Si)\n\nn(cid:88)\n\nm(cid:88)\n\n(cid:96)(h(xi\n\nj), yi\n\nj).\n\n1\nm\n\nj=1\n\nIn order to bound the difference between these two quantities we adopt the two-staged procedure\n\nused in [7]. First, we bound the difference between the empirical error (cid:98)er(Q) and the corresponding\n\nexpected multi-task risk given by:\n\nThen we continue with bounding the difference between er(Q) and (cid:101)er(Q).\n\ni=1\n\nEh\u223cQi(P,Si)E(x,y)\u223cDi(cid:96)(h(x), y).\n\nn\n\nm\n\n\u221a\n1\n\n(cid:16)\n\nn(cid:88)\n\nKL(Q||P) +\n\nSince conditioned on the observed tasks the corresponding training samples are independent, we can\nreuse the following results from [7] in order to perform the \ufb01rst step of the proof.\nTheorem 2. With probability at least 1 \u2212 \u03b4 uniformly for all Q:\n\n(cid:101)er(Q) \u2264 (cid:98)er(Q) +\nTo bound the difference between er(Q) and (cid:101)er(Q), however, the results from [7] cannot be used,\n\nbecause they rely on the assumption that the observed tasks are independent. Instead we adopt ideas\nfrom chromatic PAC-Bayesian bounds [13] that rely on the properties of a dependency graph built\nwith respect to the dependencies within the observed tasks.\nDe\ufb01nition 1 (Dependency graph). The dependency graph \u0393(t) = (V, E) of a set of random vari-\nables t = (t1, . . . , tn) is such that:\n\nEP\u223cQ KL(Qi(P, Si)||P )\n\nn + 8 log(1/\u03b4)\n\n(cid:17)\n\n\u221a\n\n8n\n\nm\n\n.\n\n(7)\n\ni=1\n\n+\n\n\u2022 the set of vertices V equals {1, . . . , n},\n\u2022 there is no edge between i and j if and only if ti and tj are independent.\n\nDe\ufb01nition 2 (Exact fractional cover [14]). Let \u0393 = (V, E) be an undirected graph with V =\n{1, . . . , n}. A set C = {(Cj, wj)}k\nj=1, where Cj \u2282 V and wj \u2208 [0, 1] for all j, is a proper exact\nfractional cover if:\n\n\u2022 for every j all vertices in Cj are independent,\n\u2022 \u222ajCj = V ,\n\n\u2022 for every i \u2208 V (cid:80)k\n\nThe sum of the weights w(C) =(cid:80)k\n\nj=1 wjIi\u2208Cj = 1.\n\nj=1 wj is the chromatic weight of C and k is the size of C.\n\nThen the following holds:\n\n3\n\n\fTheorem 3. For any \ufb01xed hyper-prior distribution P, any proper exact fractional cover C of the\ndependency graph \u0393(t1, . . . , tn) of size k and any \u03b4 > 0 the following holds with probability at least\n1 \u2212 \u03b4 uniformly for all hyper-posterior distributions Q:\nKL(Q||P) +\n\n(cid:112)w(C)(1 \u2212 8 log \u03b4 + 8 log k)\n\n(cid:114)\n\nw(C)\n\n(8)\n\n.\n\n\u221a\n8\n\nn\n\ner(Q) \u2264 (cid:101)er(Q) +\nk(cid:88)\n\ner(Q) \u2212(cid:101)er(Q) =\n\nj=1\n\nn\n\nwj\n\n(cid:88)\n(cid:16) \u03bbjw(C)\n\ni\u2208Cj\n\nn\n\nn\n\nProof. By Donsker-Varadhan\u2019s variational formula [15]:\n\n.\n\nj=1\n\n(9)\n\ni\u2208Cj\n\nwj\n\nw(C)\n\n1\n\u03bbj\n\n(cid:16)\n\nw(C)\n\nw(C)\n\nEP\u223cQ\n\nE(ti,Si),i\u2208Cj exp\n\nKL(Q||P) + log EP\u223cP exp\n\nE(t,St) ert(Qt) \u2212 eri(Qi) \u2264\n(cid:88)\n\n(cid:17)(cid:17)\nE(t,St) ert(Qt) \u2212 eri(Qi)\n\nk(cid:88)\nSince the tasks within every Cj are independent, for every \ufb01xed prior P {E(t,St) ert(Qt) \u2212\neri(Qi)}i\u2208Cj are i.i.d. and take values in [b \u2212 1, b] , where b = E(t,St) ert(Qt). Therefore, by\nHoeffding\u2019s lemma [16]:\n\nTherefore, by Markov\u2019s inequality with probability at least 1 \u2212 \u03b4j it holds that:\nj w(C)2|Cj|\n\n(cid:16) \u03bbjw(C)\n(cid:88)\n(cid:16) \u03bbjw(C)\n(cid:88)\nConsequently, we obtain with probability at least 1 \u2212(cid:80)k\nk(cid:88)\ner(Q) \u2212(cid:101)er(Q) \u2264 k(cid:88)\nBy setting \u03bb1 = \u00b7\u00b7\u00b7 = \u03bbk =(cid:112)n/w(C) and \u03b4j = \u03b4/k we obtain the statement of the theorem.\n\n(cid:17) \u2264 exp\n(cid:17) \u2264 \u03bb2\n\n(cid:16) \u03bb2\nj w(C)2|Cj|\n\n(10)\n\n(11)\n\n(12)\n\nE(t,St) ert(Qt) \u2212 eri(Qi)\n\nE(t,St) ert(Qt) \u2212 eri(Qi)\n\nwj\u03bbjw(C)|Cj|\n\n\u2212 k(cid:88)\n\nKL(Q||P) +\n\nlog EP\u223cP exp\n\n\u2212 log \u03b4j.\n\nj=1 \u03b4j:\n\nw(C)\u03bbj\n\nlog \u03b4j.\n\nw(C)\n\n(cid:17)\n\n.\n\n1\n\u03bbj\n\ni\u2208Cj\n\nn\n\ni\u2208Cj\n\n8n2\n\n8n2\n\n8n2\n\nj=1\n\nj=1\n\nwj\n\nwj\n\nj=1\n\nn\n\nBy combining Theorems 2 and 3 we obtain the main result of this section:\nTheorem 4. For any \ufb01xed hyper-prior distribution P, any proper exact fractional cover C of the\ndependency graph \u0393(t1, . . . , tn) of size k and any \u03b4 > 0 the following holds with probability at least\n1 \u2212 \u03b4 uniformly for all hyper-posterior distributions Q:\nKL(Q||P) +\n\n1 +(cid:112)w(C)mn\n\nEP\u223cQ KL(Qi(P, Si)||P )+\n\ner(Q) \u2264 (cid:98)er(Q)+\n\nn(cid:88)\n\n\u221a\n1\n\n\u221a\n\nm\n\nn\n\u221a\n\n8n\n\nm\n\nn + 8 log(2/\u03b4)\n\n+\n\n(cid:112)w(C)(1 + 8 log(2/\u03b4) + 8 log k)\n\ni=1\n\nn\n\nm\n\u221a\n8\n\nn\n\n.\n\n(13)\n\n\u221a\n\nTheorem 4 shows that even in the case of non-independent tasks a bound very similar to that in [7]\ncan be obtained. In particular, it contains two types of complexity terms: KL(Q||P) corresponds to\nthe level of the task environment and KL(Qi||P ) corresponds speci\ufb01cally to the i-th task. Similarly\nto the i.i.d. case, when the learner has access to unlimited amount of data, but for \ufb01nitely many\nobserved tasks (m \u2192 \u221e, n < \u221e), the complexity terms of the second type converge to 0 as\nm, while the \ufb01rst one does not, as there is still uncertainty on the task environment level. In the\n1/\nopposite situation, when the learner has access to in\ufb01nitely many tasks, but with only \ufb01nitely many\n\nsamples per task (m < \u221e, n \u2192 \u221e), the \ufb01rst complexity term converges to 0 as(cid:112)w(C)/n, up to\nlogarithmic terms. Thus there is a worsening comparing to the i.i.d. case proportional to(cid:112)w(C),\n\nwhich represents the amount of dependence among the tasks. If the tasks are actually i.i.d., the\ndependency graph contains no edges, so we can form a cover of size 1 with chromatic weight 1.\nThus we recover the result from [7] as a special case of Theorem 4.\n\n4\n\n\fFor general dependence graph, fastest convergence is obtained by using a cover with minimal chro-\nmatic weight. It is known that the minimal chromatic weight, \u03c7\u2217(\u0393), satis\ufb01es the following inequal-\nity [14]: 1 \u2264 c(\u0393) \u2264 \u03c7\u2217(\u0393) \u2264 \u2206(\u0393) + 1, where c(\u0393) is the order of the largest clique in \u0393 and\n\u2206(\u0393) is the maximum degree of a vertex in \u0393.\nIn some situations, even the bound obtainable from Theorem 4 by plugging in a cover with the\nminimal chromatic weight can be improved: Theorem 4 also holds for any subset ts, |ts| = s, of the\nobserved tasks with the induced dependency subgraph \u0393s. Therefore it might provide a tighter bound\n\nif \u03c7\u2217(\u0393s)/s is smaller than \u03c7\u2217(\u0393)/n. However, this is not guaranteed since the empirical error (cid:98)er\n\ncomputed on ts might become larger, as well as the second part of the bound, which decreases with\nn and does not depend on the chromatic weight of the cover. Note also that such a subset needs to\nbe chosen before observing the data, since the bound of Theorem 4 holds with probability 1\u2212 \u03b4 only\nfor a \ufb01xed set of tasks and a \ufb01xed cover.\nAnother important aspect of Theorem 4 as a PAC-Bayesian bound is that the right hand side of\ninequality (13) consists only of computable quantities. Therefore it can be seen as quality measure\nof a hyper-posterior Q and by minimizing it one could obtain a distribution that is adjusted to a\nparticular task environment. The resulting minimizer can be expected to work well even on new,\nyet unobserved tasks, because the guarantees of Theorem 4 still hold due to the uniformity of the\nbound. To do so, one can use the same techniques as in [7], because Theorem 4 differs from the\nbound provided there only by constant factors.\n\n4 Changing Task Environments\n\nIn this section we study a situation, when the task environment is gradually changing: every next\ntask ti+1 is sampled from a distribution Ti+1 over the tasks that can depend on the history of the\nprocess. Due to the change of task environment the previous idea of learning one prior for all\ntasks does not seem reasonable anymore. In contrast, we propose to learn a transfer algorithm that\nproduces a solution for the current task based on the corresponding sample set and the sample set\nfrom the previous task. Formally, we assume that there is a set A of learning algorithms that produce\na posterior distribution Qi+1 for task ti+1 based on the training samples Si and Si+1. The goal of\nthe learner is to identify an algorithm A in this set that leads to good performance when applied to a\nnew, yet unobserved task, while using the last observed training sample Sn\nFor each task ti and each algorithm A \u2208 A we de\ufb01ne the expected and empirical error of applying\nthis algorithm as follows:\n\n1.\n\n(cid:98)eri(A) = Eh\u223cQi\n\n1\nm\n\nm(cid:88)\n\nj=1\n\neri(A) = Eh\u223cQiE(x,y)\u223cDi (cid:96)(h(x), y),\n\n(cid:96)(h(xi\n\nj), yi\n\nj),\n\n(14)\n\nwhere Qi = A(Si, Si\u22121).The goal of the learner is to \ufb01nd A that minimizes ern+1 given the history\nof the observed tasks. However, if the task environment would change arbitrarily from step to step,\nthe observed tasks would not contain any relevant information for a new task. To overcome this\ndif\ufb01culty, we make the assumption that the expected performance of the algorithms in A does not\nchange over time. Formally, we assume for each A \u2208 A there exists a value, er(A), such that for\nevery i = 2, . . . , n + 1, with Ei = (Ti, ti, Si):\n\nE{Ei\u22121,Ei}[ eri(A)| E1, . . . , Ei\u22122 ] = er(A).\n\n(15)\nIn words, the quality of a transfer algorithm does not depend on when during the task sequence\nit is applied, provided that it is always applied to the subsequent sample sets. Note that this is a\nnatural assumption for lifelong learning: without it, the quality of transfer algorithms could change\nover time, so an algorithm that works well for all observed tasks might not work anymore for future\ntasks.\nThe goal of the learner can be reformulated as identifying A \u2208 A with minimal er(A), which can be\nseen as the expected value of the expected risk of applying algorithm A on the next, yet unobserved\ntask. Since er(A) is unknown, we derive an upper bound based on the observed data that holds\nuniformly for all algorithms A and therefore can be used to guide the learner. To do so, we again use\n\n1Note that this setup includes the possibility of model selection, such as predictors using different feature\n\nrepresentations or (hyper)parameter values.\n\n5\n\n\fthe construction with hyper-priors and hyper-posteriors from the previous section. Formally, let P\nbe a prior distribution over the set of possible algorithms that is \ufb01xed before any data arrives and let\nQ be a possibly data-dependent hyper-posterior. The quality of the hyper-posterior and its empirical\ncounterpart are given by the following quantities:\n\ner(Q) = EA\u223cQ er(A),\n\nSimilarly to the previous section, we \ufb01rst bound the difference between (cid:98)er(Q) and multi-task ex-\n\ni=2\n\npected error given by:\n\nn(cid:88)\n\n(cid:98)eri(A).\n\n1\n\nn \u2212 1\n\n(cid:98)er(Q) = EA\u223cQ\nn(cid:88)\n\n1\n\nn \u2212 1\n\ni=2\n\neri(A).\n\n(cid:101)er(Q) = EA\u223cQ\n\n(16)\n\n(17)\n\nEven though Theorem 2 is not directly applicable here, a more careful modi\ufb01cation of it allows to\nobtain the following result (see supplementary material for a detailed proof):\nTheorem 5. For any \ufb01xed hyper-prior distribution P with probability at least 1 \u2212 \u03b4 the following\nholds uniformly for all hyper-posterior distributions Q:\n(n \u2212 1) + 8 log(1/\u03b4)\n\nKL(Q\u00d7Q2\u00d7\u00b7\u00b7\u00b7\u00d7Qn||P \u00d7P2\u00d7\u00b7\u00b7\u00b7\u00d7Pn)+\n\n(cid:101)er(Q) \u2264 (cid:98)er(Q)+\n\n\u221a\n\n,\n\n\u221a\n8(n \u2212 1)\n\n1\n(n \u2212 1)\n\nm\n\nm\n\nwhere P2, . . . , Pn are some reference prior distributions that do not depend on the training sets of\nsubsequent tasks. Possible choices include using just one prior distribution P \ufb01xed before observing\nany data, or using the posterior distributions obtained from the previous task, i.e. Pi = Qi\u22121.\n\nTo complete the proof we need to bound the difference between er(Q) and(cid:101)er(Q). We use techniques\n\nfrom [17] in combination of those from [13], resulting in the following lemma:\nLemma 1. For any \ufb01xed algorithm A and any \u03bb the following holds:\n\nEE1,...,En exp\n\n\u03bb\n\ner(A) \u2212 1\n\nn \u2212 1\n\neri(A)\n\n(cid:17)\n\n.\n\n(18)\n\n(cid:16)\n\n(cid:16)\n\nn(cid:88)\n\ni=2\n\n(cid:17)(cid:17)\n(cid:17)\n\nn(cid:88)\n\ni=2\n\n(cid:16) \u03bb\n(cid:16) 2\u03bb\n\nn \u2212 1\n\n(cid:16)\n\n(cid:17)(cid:17) \u2264 exp\n(cid:16)(cid:88)\n(cid:88)\n\neven i\n\n\u03bb2\n\n2(n \u2212 1)\n\n(cid:88)\n\n(cid:17)\n\neri(A)\n\n= exp\n\n(b \u2212 g(Xi)) +\n\n(b \u2212 g(Xi))\n\n(b \u2212 g(Xi))\n\n+\n\n1\n2\n\nexp\n\nn \u2212 1\n\n(b \u2212 g(Xi))\n\nodd i\n\nodd i\n\n.\n\n(19)\n\n(cid:16)\n\n(cid:16)\n\nexp\n\n\u03bb\n\ner(A) \u2212 1\n\n(cid:16) 2\u03bb\n\nn \u2212 1\n\n(cid:88)\n\nn \u2212 1\n\neven i\n\n\u2264 1\n2\n\nexp\n\n(cid:17)(cid:17)\n\n(20)\n\nProof. First, de\ufb01ne Xi = (Ei\u22121, Ei) for i = 2, . . . , n and g : Xi (cid:55)\u2192 eri(A) and b = er(A). Then:\n\nNote, that both, the set of Xi-s corresponding to even i and the set of Xi-s corresponding to odd\ni, form a martingale difference sequence. Therefore by using Lemma 2 from the supplementary\nmaterial (or similarly Lemma 2 in [17]) and Hoeffding\u2019s lemma [16] we obtain:\n\nEE1,...,En exp\n\n(b \u2212 g(Xi))\n\n(cid:16) 2\u03bb\n\nn \u2212 1\n\n(cid:88)\n\neven i\n\n(cid:17) \u2264 exp\n\n(cid:16) 4\u03bb2\n\n8(n \u2212 1)\n\n(cid:17)\n\nand the same for the odd i. Together with inequality (19) it gives the statement of the lemma.\n\nNow we can prove the following statement:\nTheorem 6. For any hyper-prior distribution P and any \u03b4 > 0 with probability at least 1 \u2212 \u03b4 the\nfollowing inequality holds uniformly for all Q:\n\ner(Q) \u2264 (cid:101)er(Q) +\n\n1\u221a\nn \u2212 1\n\nKL(Q||P) +\n\n1 + 2 log(1/\u03b4)\n\n\u221a\n\n2\n\nn \u2212 1\n\n.\n\n(21)\n\nProof. By applying Donsker-Varadhan\u2019s variational formula [15] one obtains that:\n\ner(Q) \u2212(cid:101)er(Q) \u2264 1\n\n\u03bb\n\n(cid:16)\n\nKL(Q||P) + log EA\u223cP exp \u03bb\n\n6\n\n(cid:16)\n\n(cid:17)(cid:17)\n\nn(cid:88)\n\ni=2\n\ner(A) \u2212 1\n\nn \u2212 1\n\neri(A)\n\n.\n\n(22)\n\n\fFigure 1: Illustration of three learning tasks sampled from a non-stationary environment. Shaded\nareas illustrate the data distribution, + and \u2212 indicate positive and negative training examples. Be-\ntween subsequent tasks, the data distribution changes by a rotation. A transfer algorithm with access\nto two subsequent tasks can compensate for this by rotating the previous data into the new position,\nthereby obtaining more data samples to train on.\n\nFor a \ufb01xed algorithm A we obtain from Lemma 1:\n\nEE1,...,En exp\n\n\u03bb\n\ner(A) \u2212 1\n\nn \u2212 1\n\neri(A)\n\n(cid:16)\n\n(cid:17)\n\n.\n\n\u03bb2\n\n2(n \u2212 1)\n\n(23)\n\nSince P does not depend on the process, by Markov\u2019s inequality, with probability at least 1 \u2212 \u03b4, we\nobtain\n\nEA\u223cP exp \u03bb\n\ner(A) \u2212 1\n\nn \u2212 1\n\n\u03bb2\n\n2(n \u2212 1)\n\n.\n\n(24)\n\n(cid:17)\n\nThe statement of the theorem follows by setting \u03bb =\n\n(cid:16)\n\n(cid:16)\n(cid:16)\n\nn(cid:88)\nn(cid:88)\n\ni=2\n\ni=2\n\n(cid:17)(cid:17) \u2264 exp\n(cid:16)\n(cid:17) \u2264 1\n\nexp\n\n\u03b4\n\neri(A)\n\u221a\n\nn \u2212 1.\n\nBy combining Theorems 5 and 6 we obtain the main result of this section:\nTheorem 7. For any hyper-prior distribution P and any \u03b4 > 0 with probability at least 1 \u2212 \u03b4 the\nfollowing holds uniformly for all Q:\n\u221a\n(n \u2212 1)\n\n(cid:112)(n \u2212 1)m + 1\n\ner(Q) \u2264 (cid:98)er(Q) +\n\nEA\u223cQ KL(Qi(cid:107)Pi)\n\nKL(Q(cid:107)P) +\n\n\u221a\n\nm\n\nn(cid:88)\n\ni=2\n\n(n \u2212 1) + 8 log(2/\u03b4)\n\nm\n\u221a\n8(n \u2212 1)\n\nm\n\n+\n\n+\n\n1\n(n \u2212 1)\n1 + 2 log(2/\u03b4)\n\n\u221a\n2\n\nn \u2212 1\n\n,\n\n(25)\n\nwhere P2, . . . , Pn are some reference prior distributions that should not depend on the data of\nsubsequent tasks.\n\nSimilarly to Theorem 4 the above bound contains two types of complexity terms: one corresponding\n\u221a\nto the level of the changes in the task environment and task-speci\ufb01c terms. The \ufb01rst complexity term\nn \u2212 1 when the number of the observed tasks increases, indicating that\nconverges to 0 like 1/\nmore observed tasks allow for better estimation of the behavior of the transfer algorithms. The task-\nspeci\ufb01c complexity terms vanish only when the amount of observed data m per tasks grows. In\naddition, since the right hand side of the inequality (25) consists only of computable quantities and\nat the same time holds uniformly for all Q, one can obtain a posterior distribution by minimizing it\nover the transfer algorithms that is adjusted to particularly changing task environments.\nWe illustrate this process by discussing a toy example (Figure 1). Suppose that X = R2, Y =\n{\u22121, 1} and that the learner uses linear classi\ufb01ers, h(x) = sign(cid:104)w, x(cid:105), and 0/1-loss, (cid:96)(y1, y2) =\n\n(cid:74)y1 (cid:54)= y2(cid:75), for solving every task. For simplicity we assume that every task environment contains\nonly one task or, equivalently, every Ti is a delta peak, and that the change in the environment\n6 of the feature space. For the set A we use\nbetween two steps is due to a constant rotation by \u03b80 = \u03c0\na one-parameter family of transfer algorithms, A\u03b1 for \u03b1 \u2208 R. Given sample sets Sprev and Scur, any\nalgorithm A\u03b1 \ufb01rst rotates Sprev by the angle \u03b1, and then trains a linear support vector machine on the\nunion of both sets. Clearly, the quality of each transfer algorithm depends on the chosen angle, and\nan elementary calculation shows that condition (15) is ful\ufb01lled. We can therefore use the bound (25)\n\n7\n\n\fas a criterion to determine a bene\ufb01cial angle2. For that we set Qi = N (wi, I2), i.e. unit variance\nGaussian distributions with means wi. Similarly, we choose all reference prior distributions as unit\nvariance Gaussian with zero mean, Pi = N (0, I2). Analogously, we set the hyper-prior P to be\nN (0, 10), a zero mean normal distribution with enlarged variance in order to make all reasonable\nrotations \u03b1 lie within one standard deviation from the mean. As hyper-posteriors Q we choose\nN (\u03b8, 1) and the goal of the learning is to identify the best \u03b8. In order to obtain the objective function\nfrom equation (25) we \ufb01rst compute the complexity terms (and approximate all expectations with\nrespect to Q by the values at its mean \u03b8):\n\u03b82\n20\n\nKL(Q||P) =\n\n.\n\n,\n\nThe empirical error of the Gibbs classi\ufb01ers in the case of 0/1-loss and Gaussian distributions is\ngiven by the following expression (we again approximate the expectation by the value at \u03b8) [20, 21]:\n\nEA\u223cQ KL(Qi(cid:107)Pi) \u2248 (cid:107)wi(cid:107)2\n(cid:33)\n\n(cid:32)\n\n2\n\nm(cid:88)\n\nj(cid:104)wi, xi\nj(cid:105)\nyi\nj(cid:107)\n(cid:107)xi\n\n1\nm\n\n\u03a6\n\nn(cid:88)\n\ni=2\n\nj=1\n\n,\n\n(26)\n\n(cid:98)er(Q) \u2248 1\n\nn \u2212 1\n\n(cid:17)\n\n1 \u2212 erf( z\u221a\n\n(cid:16)\n(cid:112)(n \u2212 1)m + 1\n\n\u221a\n(n \u2212 1)\n\nm\n\nJ (\u03b8) =\n\n\u00b7 \u03b82\n20\n\n+\n\n1\n\nn \u2212 1\n\nwhere \u03a6(z) = 1\n2\nfunction that we obtain for identifying a bene\ufb01cial angle \u03b8 is the following:\n\nand erf(z) is the Gauss error function. The resulting objective\n\n)\n\n2\n\n\uf8eb\uf8ed(cid:107)wi(cid:107)2\n\n\u221a\n2\n\nm\n\nn(cid:88)\n\ni=2\n\n(cid:32)\n\nm(cid:88)\n\nj=1\n\n\u03a6\n\n+\n\n1\nm\n\nj(cid:104)wi, xi\nj(cid:105)\nyi\nj(cid:107)\n(cid:107)xi\n\n(cid:33)\uf8f6\uf8f8 .\n\n(27)\n\nNumeric experiments con\ufb01rm that by optimizing J (\u03b8) with respect to \u03b8 one can obtain an advan-\ntageous angle: using n = 2, . . . , 11 tasks, each with m = 10 samples, we obtain an average test\nerror of 14.2% for the (n + 1)th task. As can be expected, this lies in between the error for the\nsame setting without transfer, which was 15.0%, and the error when always rotating by \u03c0\n6 , which\nwas 13.5%.\n\n5 Conclusion\n\nIn this work we present a PAC-Bayesian analysis of lifelong learning under two types of relaxations\nof the i.i.d. assumption on the tasks. Our results show that accumulating knowledge over the course\nof learning multiple tasks can be bene\ufb01cial for the future even if these tasks are not i.i.d. In particular,\nfor the situation when the observed tasks are sampled from the same task environment but with\npossible dependencies we prove a theorem that generalizes the existing bound for the i.i.d. case.\nAs a second setting we further relax the i.i.d. assumption and allow the task environment to change\nover time. Our bound shows that it is possible to estimate the performance of applying a transfer\nalgorithm on future tasks based on its performance on the observed ones. Furthermore, our result\ncan be used to identify a bene\ufb01cial algorithm based on the given data and we illustrate this process\non a toy example. For future work, we plan to expand on this aspect. Essentially, any existing\ndomain adaptation algorithm can be used as a transfer method in our setting. However, the success\nof domain adaptation techniques is often caused by asymmetry between the source and the target -\nsuch algorithms usually rely on availability of extensive amounts of data from the source and only\nlimited amounts from the target. In contrast, in lifelong learning setting all the tasks are assumed to\nbe equipped with limited training data. Therefore we are particularly interested in identifying how\nfar the constant quality assumption can be carried over to existing domain adaptation techniques and\nreal-world lifelong learning situations.\n\nAcknowledgments. This work was in parts funded by the European Research Council under\nthe European Union\u2019s Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement no\n308036.\n\n2Note that Theorem 7 provides an upper bound for the expected error of stochastic Gibbs classi\ufb01ers, and\nnot deterministic ones that are preferable in practice. However for 0/1-loss the error of a Gibbs classi\ufb01er is\nbounded from below by half the error of the corresponding majority vote predictor [18, 19] and therefore twice\nthe right hand side of (25) provides a bound for deterministic classi\ufb01ers.\n\n8\n\n\fReferences\n[1] Sebastian Thrun and Tom M. Mitchell. Lifelong robot learning. Technical report, Robotics\n\nand Autonomous Systems, 1993.\n\n[2] Jonathan Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Re-\n\nsearch, 12:149\u2013198, 2000.\n\n[3] Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142,\n\n1984.\n\n[4] Vladimir N. Vapnik. Estimation of Dependencies Based on Empirical Data. Springer, 1982.\n[5] Andreas Maurer. Algorithmic stability and meta-learning. Journal of Machine Learning Re-\n\nsearch (JMLR), 6:967\u2013994, 2005.\n\n[6] Gilles Blanchard, Gyemin Lee, and Clayton Scott. Generalizing from several related classi-\n\ufb01cation tasks to a new unlabeled sample. In Conference on Neural Information Processing\nSystems (NIPS), 2011.\n\n[7] Anastasia Pentina and Christoph H. Lampert. A PAC-Bayesian bound for lifelong learning. In\n\nInternational Conference on Machine Learing (ICML), 2014.\n\n[8] Andreas Maurer. Transfer bounds for linear feature learning. Machine Learning, 75:327\u2013350,\n\n2009.\n\n[9] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. Sparse coding for\nIn International Conference on Machine Learing (ICML),\n\nmultitask and transfer learning.\n2013.\n\n[10] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. Ef\ufb01cient representations for life-\nlong learning and autoencoding. In Workshop on Computational Learning Theory (COLT),\n2015.\n\n[11] David A. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355\u2013363,\n\n1999.\n\n[12] Matthias Seeger. PAC-Bayesian generalisation error bounds for gaussian process classi\ufb01cation.\n\nJournal of Machine Learning Research (JMLR), 3:233\u2013269, 2003.\n\n[13] Liva Ralaivola, Marie Szafranski, and Guillaume Stempfel. Chromatic PAC-Bayes bounds for\nnon-iid data: Applications to ranking and stationary \u03b2-mixing processes. Journal of Machine\nLearning Research (JMLR), 2010.\n\n[14] Daniel Ullman and Edward Scheinerman. Fractional Graph Theory: A Rational Approach to\n\nthe Theory of Graphs. Wiley Interscience Series in Discrete Mathematics, 1997.\n\n[15] Monroe. D. Donsker and S. R. Srinivasa Varadhan. Asymptotic evaluation of certain Markov\nprocess expectations for large time. I. Communications on Pure and Applied Mathematics,\n28:1\u201347, 1975.\n\n[16] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of\n\nthe American Statistical Association, 58:13\u201330, 1963.\n\n[17] Yevgeny Seldin, Franois Laviolette, Nicol Cesa-Bianchi, John Shawe-Taylor, and Peter Auer.\nIEEE Transactions on Information Theory,\n\nPAC-Bayesian inequalities for martingales.\n58:7086\u20137093, 2012.\n\n[18] David A. McAllester. Simpli\ufb01ed PAC-Bayesian margin bounds. In Workshop on Computa-\n\ntional Learning Theory (COLT), 2003.\n\n[19] Franc\u00b8ois Laviolette and Mario Marchand. PAC-Bayes risk bounds for stochastic averages\nand majority votes of sample-compressed classi\ufb01ers. Journal of Machine Learning Research\n(JMLR), 8:1461\u20131487, 2007.\n\n[20] John Langford and John Shawe-Taylor. PAC-Bayes and margins. In Conference on Neural\n\nInformation Processing Systems (NIPS), 2002.\n\n[21] Pascal Germain, Alexandre Lacasse, Franc\u00b8ois Laviolette, and Mario Marchand. PAC-Bayesian\nlearning of linear classi\ufb01ers. In International Conference on Machine Learing (ICML), 2009.\n\n9\n\n\f", "award": [], "sourceid": 953, "authors": [{"given_name": "Anastasia", "family_name": "Pentina", "institution": "IST Austria"}, {"given_name": "Christoph", "family_name": "Lampert", "institution": "IST Austria"}]}