{"title": "Transfer from Multiple MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1746, "page_last": 1754, "abstract": "Transfer reinforcement learning (RL) methods leverage on the experience collected on a set of source tasks to speed-up RL algorithms. A simple and effective approach is to transfer samples from source tasks and include them in the training set used to solve a target task. In this paper, we investigate the theoretical properties of this transfer method and we introduce novel algorithms adapting the transfer process on the basis of the similarity between source and target tasks. Finally, we report illustrative experimental results in a continuous chain problem.", "full_text": "Transfer from Multiple MDPs\n\nAlessandro Lazaric\n\nINRIA Lille - Nord Europe, Team SequeL, France\n\nalessandro.lazaric@inria.fr\n\nDepartment of Electronics and Informatics, Politecnico di Milano, Italy\n\nMarcello Restelli\n\nrestelli@elet.polimi.it\n\nAbstract\n\nTransfer reinforcement learning (RL) methods leverage on the experience col-\nlected on a set of source tasks to speed-up RL algorithms. A simple and effective\napproach is to transfer samples from source tasks and include them in the train-\ning set used to solve a target task. In this paper, we investigate the theoretical\nproperties of this transfer method and we introduce novel algorithms adapting the\ntransfer process on the basis of the similarity between source and target tasks.\nFinally, we report illustrative experimental results in a continuous chain problem.\n\n1 Introduction\nThe objective of transfer in reinforcement learning (RL) [10] is to speed-up RL algorithms by reusing\nknowledge (e.g., samples, value function, features, parameters) obtained from a set of source tasks.\nThe underlying assumption of transfer methods is that the source tasks (or a suitable combination\nof these) are somehow similar to the target task, so that the transferred knowledge can be useful in\nlearning its solution. A wide range of scenarios and methods for transfer in RL have been studied\nin the last decade (see [12, 6] for a thorough survey). In this paper, we focus on the simple transfer\napproach where trajectory samples are transferred from source MDPs to increase the size of the\ntraining set used to solve the target MDP. This approach is particularly suited in problems (e.g.,\nrobotics, applications involving human interaction) where it is not possible to interact with the envi-\nronment long enough to collect samples to solve the task at hand. If samples are available from other\nsources (e.g., simulators in case of robotic applications), the solution of the target task can bene\ufb01t\nfrom a larger training set that includes also some source samples. This approach has been already\ninvestigated in the case of transfer between tasks with different state-action spaces in [11], where the\nsource samples are used to build a model of the target task whenever the number of target samples is\nnot large enough. A more sophisticated sample-transfer method is proposed in [5]. The authors in-\ntroduce an algorithm which estimates the similarity between source and target tasks and selectively\ntransfers from the source tasks which are more likely to provide samples similar to those generated\nby the target MDP. Although the empirical results are encouraging, the proposed method is based\non heuristic measures and no theoretical analysis of its performance is provided. On the other hand,\nin supervised learning a number of theoretical works investigated the effectiveness of transfer in\nreducing the sample complexity of the learning process. In domain adaptation, a solution learned on\na source task is transferred to a target task and its performance depends on how similar the two tasks\nare. In [1] and [8] different distance measures are proposed and are shown to be connected to the\nperformance of the transferred solution. The case of transfer of samples from multiple source tasks\nis studied in [2]. The most interesting \ufb01nding is that the transfer performance bene\ufb01ts from using a\nlarger training set at the cost of an additional error due to the average distance between source and\ntarget tasks. This implies the existence of a transfer tradeoff between transferring as many samples\nas possible and limiting the transfer to sources which are similar to the target task. As a result, the\ntransfer of samples is expected to outperform single-task learning whenever negative transfer (i.e.,\ntransfer from source tasks far from the target task) is limited w.r.t. to the advantage of increasing\n\n1\n\n\fthe size of the training set. This also opens the question whether it is possible to design methods\nable to automatically detect the similarity between tasks and adapt the transfer process accordingly.\nIn this paper, we investigate the transfer of samples in RL from a more theoretical perspective w.r.t.\nprevious works. The main contributions of this paper can be summarized as follows:\n\n\u2022 Algorithmic contribution. We introduce three sample-transfer algorithms based on \ufb01tted\nQ-iteration [3]. The \ufb01rst algorithm (AST in Sec. 3) simply transfers all the source samples.\nWe also design two adaptive methods (BAT and BTT in Sec. 4 and 5) whose objective is\nto solve the transfer tradeoff by identifying the best combination of source tasks.\n\n\u2022 Theoretical contribution. We formalize the setting of transfer of samples and we derive a\n\ufb01nite-sample analysis of AST which highlights the importance of the average MDP ob-\ntained by the combination of the source tasks. We also report the analysis for BAT which\nshows both the advantage of identifying the best combination of source tasks and the addi-\ntional cost in terms of auxiliary samples needed to compute the similarity between tasks.\n\u2022 Empirical contribution. We report results (in Sec. 6) on a simple chain problem which\ncon\ufb01rm the main theoretical \ufb01ndings and support the idea that sample transfer can signi\ufb01-\ncantly speed-up the learning process and that adaptive methods are able to solve the transfer\ntradeoff and avoid negative transfer effects.\n\nThe proofs and additional experiments are available in [7].\n\n\u00b5 = 1\n\ni=1 \u03b12\n\n2 Preliminaries\nIn this section we introduce the notation and the transfer problem considered in the rest of the paper.\nWe de\ufb01ne a discounted Markov decision process (MDP) as a tuple M = hX ,A,R,P, \u03b3i where\nthe state space X is a bounded closed subset of the Euclidean space, A is a \ufb01nite (|A| < \u221e)\naction space, the reward function R : X \u00d7 A \u2192 R is uniformly bounded by Rmax, the transition\nkernel P is such that for all x \u2208 X and a \u2208 A, P(\u00b7|x, a) is a distribution over X , and \u03b3 \u2208\n(0, 1) is a discount factor. We denote by S(X \u00d7 A) the set of probability measures over X \u00d7 A\nand by B(X \u00d7 A; Vmax = Rmax\n1\u2212\u03b3 ) the space of bounded measurable functions with domain X \u00d7 A\nand bounded in [\u2212Vmax, Vmax]. We de\ufb01ne the optimal action-value function Q\u2217 as the unique\n\ufb01xed-point of the optimal Bellman operator T : B(X \u00d7 A; Vmax) \u2192 B(X \u00d7 A; Vmax) de\ufb01ned as\n(T Q)(x, a) = R(x, a) + \u03b3RX maxa\u2032\u2208A Q(y, a\u2032)P(dy|x, a).\nFor any measure \u00b5 \u2208 S(X \u00d7 A) obtained from the combination of a distribution \u03c1 \u2208 S(X ) and\na uniform distribution over the discrete set A, and a measurable function f : X \u00d7 A \u2192 R, we\n|A|Pa\u2208ARX f (x, a)2\u03c1(dx). The supremum norm of f is\nde\ufb01ne the L2(\u00b5)-norm of f as ||f||2\nde\ufb01ned as ||f||\u221e = supx\u2208X |f (x)|. Finally, we de\ufb01ne the standard L2-norm for a vector \u03b1 \u2208 Rd\ni . We denote by \u03c6(\u00b7,\u00b7) =(cid:0)\u03d51(\u00b7,\u00b7), . . . , \u03d5d(\u00b7,\u00b7)(cid:1)\u22a4 a feature vector with features\nas ||\u03b1||2 =Pd\n: X \u00d7 A \u2192 [\u2212C, C], and by F = {f\u03b1(\u00b7,\u00b7) = \u03c6(\u00b7,\u00b7)\u22a4\u03b1} the linear space of action-value\n\u03d5i\nfunctions spanned by the basis functions in \u03c6. Given a set of state-action pairs {(Xl, Al)}L\nl=1, let\n\u03a6 = [\u03c6(X1, A1)\u22a4; . . . ; \u03c6(XL, AL)\u22a4] be the corresponding feature matrix. We de\ufb01ne the orthogonal\nprojection operator \u03a0 : B(X \u00d7 A; Vmax) \u2192 F as \u03a0Q = arg minf \u2208F ||Q \u2212 f||\u00b5. Finally, by T (Q)\nwe denote the truncation of a function Q in the range [\u2212Vmax, Vmax].\nWe consider the transfer problem in which M tasks {Mm}M\nto learn the solution for the target task M1 transferring samples from the source tasks {Mm}M\nWe de\ufb01ne an assumption on how the training sets are generated.\nDe\ufb01nition 1. (Random Tasks Design) An input set {(Xl, Al)}L\nl=1 is built with samples drawn from\nan arbitrary sampling distribution \u00b5 \u2208 S(X \u00d7 A), i.e. (Xl, Al) \u223c \u00b5. For each task m, one\ntransition and reward sample is generated in each of the state-action pairs in the input set, i.e.\nY m\nl \u223c P(\u00b7|Xl, Al), and Rm\nl=1 where\nthe indexes Ml are drawn i.i.d. from a multinomial distribution with parameters (\u03bb1, . . . , \u03bbM ). The\ntraining set available to the learner is {(Xl, Al, Yl, Rl)}L\nThis is an assumption on how the samples are generated but in practice, a single realization of\nsamples and task indexes Ml is available. We consider the case in which \u03bb1 \u226a \u03bbm (m = 2, . . . , M ).\nThis condition implies that (on average) the number of target samples is much less than the source\n\nl = R(Xl, Al). Finally, we de\ufb01ne the random sequence {Ml}L\n\nm=1 are available and the objective is\nm=2.\n\nl=1 where Yl = Yl,Ml and Rl = Rl,Ml.\n\n2\n\n\ffor k = 1, 2, . . . do\n\nBuild the training set {(Xl, Al, Yl, Rl)}L\nBuild the feature matrix \u03a6 = [\u03c6(X1, A1)\u22a4; . . . ; \u03c6(XL, AL)\u22a4]\n\nInput: Linear space F = span{\u03d5i, 1 \u2264 i \u2264 d}, initial function eQ0 \u2208 F\nCompute the vector p \u2208 RL with pl = Rl + \u03b3 maxa\u2032\u2208A eQk\u22121(Yl, a\u2032)\nCompute the projection \u02c6\u03b1k = (\u03a6\u22a4\u03a6)\u22121\u03a6\u22a4p and the function bQk = f \u02c6\u03b1k\nReturn the truncated function eQk = T (bQk)\n\nend for\n\nl=1 [according to random tasks design]\n\nFigure 1: A pseudo-code for All-Sample Transfer (AST) Fitted Q-iteration.\n\nsamples and it is usually not enough to learn an accurate solution for the target task. We will also\nconsider the pure transfer case in which \u03bb1 = 0 (i.e., no target sample is available). Finally, we\nnotice that Def. 1 implies the existence of a generative model for all the MDPs, since the state-\naction pairs are generated according to an arbitrary sampling distribution \u00b5.\n\n3 All-Sample Transfer Algorithm\nWe \ufb01rst consider the case when the source samples are generated according to Def. 1 and the de-\nsigner has no access to the source tasks. We study the algorithm called All-Sample Transfer (AST)\n(Fig. 1) which simply runs FQI with a linear space F on the whole training set {(Xl, Al, Yl, Rl)}L\nl=1.\nAt each iteration k, given the result of the previous iteration eQk\u22121 = T (bQk\u22121), the algorithm returns\n\n(1)\n\nbQk = arg min\n\nf \u2208F\n\n1\nL\n\nLXl=1(cid:18)f (Xl, Al) \u2212 (Rl + \u03b3 max\n\na\u2032\u2208AeQk\u22121(Yl, a\u2032))(cid:19)2\n\n.\n\nIn the case of linear spaces, the minimization problem is solved in closed form as in Fig. 1. In the\nfollowing we report a \ufb01nite-sample analysis of the performance of AST. Similar to [9], we \ufb01rst study\nthe prediction error in each iteration and we then propagate it through iterations.\n\n\u2217\n\nm=1 \u03bbmRm, P \u03bb =PM\n\nm=1, with M1 the target task. Let the training\nl=1 be generated as in Def. 1, with a proportion vector \u03bb = (\u03bb1, . . . , \u03bbM ). Let\n\n3.1 Single Iteration Finite-Sample Analysis\nWe de\ufb01ne the average MDP M\u03bb as the average of the M MDPs at hand. We de\ufb01ne its reward\nfunction R\u03bb and its transition kernel P \u03bb as the weighted average of reward functions and transition\nkernels of the basic MDPs with weights determined by the proportions \u03bb of the multinomial distribu-\ntion in the de\ufb01nition of the random tasks design (i.e., R\u03bb =PM\nm=1 \u03bbmPm).\nWe also denote by T \u03bb its optimal Bellman operator. In the random tasks design, the average MDP\nplays a crucial role since the implicit target function of the minimization of the empirical loss in\nEq. 1 is indeed T \u03bbeQk\u22121. At each iteration k, we prove the following performance bound for AST.\nTheorem 1. Let M be the number of tasks {Mm}M\nset {(Xl, Al, Yl, Rl)}L\n= \u03a0T1eQk\u22121 = arg inf f \u2208F ||f \u2212 T1eQk\u22121||\u00b5, then for any 0 < \u03b4 \u2264 1, bQk (Eq. 1) satis\ufb01es\nf\u03b1k\n\u2217 \u2212 T1eQk\u22121||\u00b5 + 5qE\u03bb(eQk\u22121)\n||T (bQk) \u2212 T1eQk\u22121||\u00b5 \u2264 4||f\u03b1k\n+ 32Vmaxs 2\n\u2217||)r 2\n(cid:19).\n+ 24(Vmax + C||\u03b1k\nwith probability 1 \u2212 \u03b4 (w.r.t. samples), where ||\u03d5i||\u221e\u2264 C and E\u03bb(eQk\u22121) = k(T1 \u2212 T \u03bb)eQk\u22121k2\nRemark 1 (Analysis of the bound). We \ufb01rst notice that the previous bound reduces (up to constants)\nto the standard bound for FQI when M = 1 [7]. The bound is composed by three main terms: (i)\napproximation error, (ii) estimation error, and (iii) transfer error. The approximation error ||f\u03b1k\n\u2217 \u2212\nT1eQk\u22121||\u00b5 is the smallest error of functions in F in approximating the target function T1eQk\u22121 and it\ndoes not depend on the transfer algorithm. The estimation error (third and fourth terms in the bound)\nis due to the \ufb01nite random samples used to learn bQk and it depends on the dimensionality d of the\n\nfunction space and it decreases with the total number of samples L with the fast rate of linear spaces\n\nlog(cid:18) 27(12Le2)2(d+1)\n\nlog\n\n9\n\u03b4\n\n\u00b5.\n\nL\n\nL\n\n\u03b4\n\n3\n\n\f(O(d/L) instead of O(pd/L)). Finally, the transfer error E\u03bb accounts for the difference between\nsource and target tasks. In fact, samples from source tasks different from the target might bias bQk\ntowards a wrong solution, thus resulting in a poor approximation of the target function T1eQk\u22121. It\nis interesting to notice that the transfer error depends on the difference between the target task and\nthe average MDP M\u03bb obtained by taking a linear combination of the source tasks weighted by the\nparameters \u03bb. This means that even when each of the source tasks is very different from the target,\nif there exists a suitable combination which is similar to the target task, then the transfer process is\nstill likely to be effective. Furthermore, E\u03bb considers the difference in the result of the application\nof the two Bellman operators to a given function eQk\u22121. As a result, when the two operators T1 and\nT \u03bb have the same reward functions, even if the transition distributions are different (e.g., the total\nvariation ||P1(\u00b7|x, a) \u2212 P \u03bb(\u00b7|x, a)||TV is large), their corresponding averages of eQk\u22121 might still be\nsimilar (i.e.,R maxa\u2032 eQ(y, a\u2032)P1(dy|x, a) similar toR maxa\u2032 eQ(y, a\u2032)P \u03bb(dy|x, a)).\nRemark 2 (Comparison to single-task learning). Let bQk\niteration of FQI with only samples from the source task, the performance bounds of bQk and bQk\n+ Vmaxr d\n+pE\u03bb ,\nkT (bQk) \u2212 T1eQk\u22121k\u00b5 \u2264 ||f\u03b1k\n+ Vmaxr d\ns) \u2212 T1eQk\u22121k\u00b5 \u2264 ||f\u03b1k\nkT (bQk\ns uses only N1 samples and, as a result, has a much bigger estimation error than bQk, which\nis that bQk\ntakes advantage of all the L samples transferred from the source tasks. At the same time, bQk suffers\n\nfrom an additional transfer error. Thus, we can conclude that AST is expected to perform better\nthan single-task learning whenever the advantage of using more samples is greater than the bias\ndue to samples coming from tasks different from the target task. This introduces a transfer tradeoff\nbetween including many source samples, so as to reduce the estimation error, and \ufb01nding source\ntasks whose combination leads to a small transfer error. In Sec. 4 we de\ufb01ne an adaptive transfer\nalgorithm which selects proportions \u03bb so as to keep the transfer error E\u03bb as small as possible. Finally,\nin Sec. 5 we consider a different setting where the number of samples in each source is limited.\n\n\u2217 \u2212 T1eQk\u22121||\u00b5 + (Vmax + C||\u03b1k\n\u2217 \u2212 T1eQk\u22121||\u00b5 + (Vmax + C||\u03b1k\n\nwith N1 = \u03bb1L (on average). Both bounds have the same approximation error. The main difference\n\ns be the solution obtained by solving one\ns can\n\nbe written as (up to constants and logarithmic factors)\n\n\u2217||)r 1\n\u2217||)r 1\n\nN1\n\nN1\n\nL\n\nL\n\n,\n\n\u03c01 \u00b7\u00b7\u00b7P 1\n\n\u03c01 \u00b7\u00b7\u00b7P 1\n\n\u03c0p )/\u03bd||\u221e satis\ufb01es C\u00b5,\u03bd = (1 \u2212 \u03b32)2Pp p\u03b3p\u22121c(p) < \u221e.\n\n3.2 Propagation Finite-Sample Analysis\nWe now study how the previous error is propagated through iterations. Let \u03bd be the evaluation norm\n(i.e., in general different from the sampling distribution \u00b5). We \ufb01rst report two assumptions. 1\nAssumption 1. [9] Given \u00b5, \u03bd, p \u2265 1, and an arbitrary sequence of policies {\u03c0p}p\u22651, we assume\nthat the future-state distribution \u00b5P 1\nis absolutely continuous w.r.t. \u03bd. We assume that\nc(p) = sup\u03c01\u00b7\u00b7\u00b7\u03c0p ||d(\u00b5P 1\nAssumption 2. Let G \u2208 Rd\u00d7d be the Gram matrix with [G]ij =R \u03d5i(x, a)\u03d5j (x, a)\u00b5(dx, a). We\nreturns an action-value function eQK, whose corresponding greedy policy \u03c0K satis\ufb01es\n||Q\u2217 \u2212 Q\u03c0K||\u03bd \u2264\n(cid:19) +\n\nassume that its smallest eigenvalue \u03c9 is strictly positive (i.e., \u03c9 > 0).\nTheorem 2. Let Assumptions 1 and 2 hold and the setting be as in Thm. 1. After K iterations, AST\n\n(1 \u2212 \u03b3)3/2pC\u00b5,\u03bd\"4 sup\nVmax\u221a\u03c9\n\nlog(cid:18) 27K(12Le2)2(d+1)\n\n+ 32Vmaxs 2\n\ninf\nf \u2208F ||f \u2212 T1g||\u00b5 + 5 sup\n\n\u03b1 k(T1 \u2212 T \u03bb)T (f\u03b1)k\u00b5\n\n\u03b3K#.\n\n)r 2\n\n+ 56(Vmax +\n\ng\u2208F\n\nlog\n\nL\n\n9K\n\u03b4\n\nL\n\n\u03b4\n\n2\u03b3\n\n\u03c0p\n\n4\n\nRemark (Analysis of the bound). The bound reported in the previous theorem displays few dif-\nferences w.r.t.\nto the single-iteration bound (see [7] for further discussion). The transfer error\nsup\u03b1 k(T1 \u2212 T \u03bb)T (f\u03b1)k\u00b5 characterizes the difference between the target and average Bellman op-\nerators through the space F. As a result, even MDPs with signi\ufb01cantly different rewards and tran-\nsitions might have a small transfer error because of the functions in F. This introduces a tradeoff\n\n1We refer to [9] for a thorough explanation of the concentrability terms.\n\n2VmaxpC\u00b5,\u03bd\n\n\fBuild the auxiliary set {(Xs, As, Rs,1, . . . , Rs,M }S\nfor k = 1, 2, . . . do\n\nInput: Space F = span{\u03d5i, 1 \u2264 i \u2264 d}, initial function eQ0 \u2208 F, number of samples L\nComputeb\u03bbk = arg min\u03bb\u2208\u039bbE\u03bb(eQk\u22121)\nRun one iteration of AST (Fig. 1) using L samples generated according tob\u03bbk\n\ns=1 and {Y t\n\ns,1,. . ., Y t\n\nFigure 2: A pseudo-code for the Best Average Transfer (BAT) algorithm.\n\nend for\n\ns,M }T\n\nt=1 for each s\n\nin the design of F between a \u201clarge\u201d enough space containing functions able to approximate T1Q\n(i.e., small approximation error) and a small function space where the Q-functions induced by T1\nand T \u03bb can be closer (i.e., small transfer error). This term also displays interesting similarities with\nthe notion of discrepancy introduced in [8] in domain adaptation.\n\n4 Best Average Transfer Algorithm\nAs discussed in the previous section, the transfer error E\u03bb plays a crucial role in the comparison with\nsingle-task learning. In particular, E\u03bb is related to the proportions \u03bb inducing the average Bellman\noperator T \u03bb which de\ufb01nes the target function approximated at each iteration. We now consider\nthe case where the designer has direct access to the source tasks (i.e., it is possible to choose how\nmany samples to draw from each source) and can de\ufb01ne an arbitrary proportion \u03bb. In particular, we\npropose a method that adapts \u03bb at each iteration so as to minimize the transfer error E\u03bb.\nWe consider the case in which L is \ufb01xed as a parameter of the algorithm and \u03bb1 = 0 (i.e.,\nno target samples are used in the learning training set). At each iteration k, we need to esti-\n\n1\nS\n\n\u03bbmRs,m +\n\n\u03b3\nT\n\na\u2032\n\na\u2032\n\nMXm=2\n\nMXm=2\n\n\u03bbm max\n\nQ(Y t\n\nQ(Y t\n\ns,1, a\u2032)\u2212\n\nTXt=1(cid:16) max\n\ns,m, a\u2032)(cid:17)#2\n\nSXs=1\"Rs,1 \u2212\n\nmate the quantity E\u03bb(eQk\u22121). We assume that for each task additional samples available. Let\ns=1 be an auxiliary training set where (Xs, As) \u223c \u00b5 and Rs,m =\n{(Xs, As, Rs,1, . . . , Rs,M )}S\nRm(Xs, As). In each state-action pair, we generate T next states for each task, that is Y t\ns,m \u223c\nPm(\u00b7|Xs, As) with t = 1, . . . , T . Thus, for any function Q we de\ufb01ne the estimated transfer error as\nbE\u03bb(Q) =\nAt each iteration, the algorithm Best Average Transfer (BAT) (Fig. 2) \ufb01rst computes b\u03bbk =\narg min\u03bb\u2208\u039bbE\u03bb(eQk\u22121), where \u039b is the (M -2)-dimensional simplex, and then runs an itera-\ntion of AST with samples generated according to the proportions b\u03bbk. We denote by \u03bbk\narg min\u03bb\u2208\u039b E\u03bb(eQk\u22121) the best combination at iteration k.\nTheorem 3. Let eQk\u22121 be the function returned at the previous iteration and bQk\nreturned by the BAT algorithm (Fig. 2). Then for any 0 < \u03b4 \u2264 1, bQk\n\u2217 \u2212 T1eQk\u22121||\u00b5 + 5qE\u03bbk\nBAT) \u2212 T1eQk\u22121||\u00b5 \u2264 4||f\u03b1k\n(eQk\u22121)\n!1/4\n+ 5p2Vmax (M \u2212 2) log 8S/\u03b4\n+ 20Vmaxr log 8SM/\u03b4\n+ 32Vmaxs 2\n\u2217||)r 2\nlog(cid:18) 54(12Le2)2(d+1)\n\n+ 24(Vmax + C||\u03b1k\n\n||T (bQk\n\nBAT the function\n\nBAT satis\ufb01es\n\nlog\n\nL\n\n18\n\u03b4\n\nL\n\n\u03b4\n\n.\n\n(2)\n\n\u2217 =\n\n(cid:19).\n\n\u2217\n\nS\n\nT\n\nwith probability 1 \u2212 \u03b4.\nRemark 1 (Comparison with AST and single-task learning). The bound shows that BAT outper-\nis larger\nforms AST whenever the advantage in achieving the smallest possible transfer error E\u03bbk\nthan the additional estimation error due to the auxiliary training set. When compared to single-task\nlearning, BAT has a better performance whenever the best combination of source tasks has a small\ntransfer error and the additional auxiliary estimation error is smaller than the estimation error in\nsingle-task learning. In particular, this means that O((M/S)1/4) + O((1/T )1/2) should be smaller\nthan O((d/N )1/2) (with N the number of target samples). The number of calls to the generative\n\n\u2217\n\n5\n\n\fTable 1: Parameters for the \ufb01rst set of tasks\n\nTable 2: Parameters for the second set of tasks\n\ntasks\n\nM1\n\nM2\nM3\nM4\nM5\n\np\n\n0.9\n\n0.9\n0.9\n0.9\n0.9\n\nl\n\n1\n\n2\n1\n1\n1\n\n\u03b7\n\nReward\n\n0.1 +1 in [\u221211, \u22129] \u222a [9, 11]\n\n0.1 \u22125 in [\u221211, \u22129] \u222a [9, 11]\n0.1 +5 in [\u221211, \u22129] \u222a [9, 11]\n+1 in [\u22126, \u22124] \u222a [4, 6]\n0.1\n\u22121 in [\u22126, \u22124] \u222a [4, 6]\n0.1\n\ntasks\n\nM1\n\nM6\nM7\nM8\nM9\n\np\n\n0.9\n\n0.7\n0.1\n0.9\n0.7\n\nl\n\n1\n\n1\n1\n1\n1\n\n\u03b7\n\nReward\n\n0.1 +1 in [\u221211, \u22129] \u222a [9, 11]\n\n0.1 +1 in [\u221211, \u22129] \u222a [9, 11]\n0.1 +1 in [\u221211, \u22129] \u222a [9, 11]\n0.1 \u22125 in [\u221211, \u22129] \u222a [9, 11]\n0.5 +5 in [\u221211, \u22129] \u222a [9, 11]\n\nmodel for BAT is ST . In order to have a fair comparison with single-task learning we set S = N 2/3\nand T = N 1/3, then we obtain the condition M \u2264 d2N \u22124/3 that constrains the number of tasks to\nbe smaller than the dimensionality of F. We remark that the dependency of the auxiliary estimation\nerror on M is due to the fact that the \u03bb vectors (over which the transfer error is optimized) belong\nto the simplex \u039b of dimensionality M -2. Hence, the previous condition suggests that, in general,\nadaptive transfer methods may signi\ufb01cantly improve the transfer performance (i.e., in this case a\nsmaller transfer error) at the cost of additional sources of errors which depend on the dimensionality\nof the search space used to adapt the transfer process (in this case \u039b).\n\n5 Best Transfer Trade-off Algorithm\nThe previous algorithm is proved to successfully estimate the combination of source tasks which\nbetter approximates the Bellman operator of the target task. Nonetheless, BAT relies on the implicit\nassumption that L samples can always be generated from any source task 2 and it cannot be applied\nto the case where the number of source samples is limited. Here we consider the more challenging\ncase where the designer has still access to the source tasks but only a limited number of samples\nis available in each of them. In this case, an adaptive transfer algorithm should solve a tradeoff\nbetween selecting as many samples as possible, so as to reduce the estimation error, and choosing\nthe proportion of source samples properly, so as to control the transfer error. The solution of this\ntradeoff may return non-trivial results, where source tasks similar to the target task but with few\nsamples are removed in favor of a pool of tasks whose average roughly approximate the target task\nbut can provide a larger number of samples.\n\nHere we introduce the Best Tradeoff Transfer (BTT) algorithm. Similar to BAT, it relies on an\nauxiliary training set to solve the tradeoff. We denote by Nm the maximum number of samples\navailable for source task m. Let \u03b2 \u2208 [0, 1]M be a weight vector, where \u03b2m is the fraction of samples\nfrom task m used in the transfer process. We denote by E\u03b2 (bE\u03b2) the transfer error (the estimated\ntransfer error) with proportions \u03bb where \u03bbm = (\u03b2mNm)/Pm\u2032(\u03b2m\u2032 Nm\u2032). At each iteration k, BTT\n\nreturns the vector \u03b2 which optimizes the tradeoff between estimation and transfer errors, that is\n\nwhere \u03c4 is a parameter. While the \ufb01rst term accounts for the transfer error induced by \u03b2, the second\nterm is the estimation error due to the total amount of samples used by the algorithm.\n\nUnlike AST and BAT, BTT is a heuristic algorithm motivated by the bound in Thm. 1 and we do not\nprovide any theoretical guarantee for it. The main technical dif\ufb01culty is that the setting considered\nhere does not match the random task design assumption (see Def. 1) since the number of source\nsamples is constrained by Nm. As a result, given a proportion \u03bb, we cannot assume samples to be\ndrawn at random according to a multinomial of parameters \u03bb. Without this assumption, it is an open\nquestion whether a similar bound as AST and BAT could be derived.\n6 Experiments\nIn this section, we report preliminary experimental results of the transfer algorithms. The main ob-\njective is to illustrate the functioning of the algorithms and compare their results with the theoretical\n\ufb01ndings. We consider a continuous extension of the chain walk problem proposed in [4]. The state\nis described by a continuous variable x and two actions are available: one that moves toward left and\nthe other toward right. With probability p each action makes a step of length l, affected by a noise \u03b7,\n\n2If \u03bbm = 1 for task m, then the algorithm would generate all the L training samples from task m.\n\n6\n\n\u02c6\u03b2k = arg min\n\n\u03b2\u2208[0,1]M(cid:16)bE\u03b2(eQk\u22121) + \u03c4s\n\nd\n\nm=1 \u03b2mNm(cid:17),\nPM\n\n(3)\n\n\fp\ne\n\nt\ns\n \nr\ne\np\n\n \n\nd\nr\na\nw\ne\nr\n \ne\ng\na\nr\ne\nv\nA\n\n 0.7\n\n 0.6\n\n 0.5\n\n 0.4\n\n 0.3\n\n 0.2\n\n 0.1\n\n 0\n\n-0.1\n\n-0.2\n\n 0.5\n\n 0.4\n\n 0.3\n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\nP\n\n 0.2\n\n 0.1\n\n 0\n\n-0.1\n\nwithout transfer\nBAT with 1000 samples\nBAT with 5000 samples\nBAT with 10000 samples\nAST with 10000 samples\n\n\u03bb\n2\u03bb\n3\u03bb\n4\u03bb\n5\n\n 0\n\n 2000\n\n 4000\n\n 6000\n\n 8000\n\n 10000\n\n 1\n\n 2\n\n 3\n\n 4\n\n 5\n\n 6\n\n 7\n\n 8\n\n 9\n\n 10 11 12 13\n\nNumber of target samples\n\nNumber of iterations\n\nFigure 3: Transfer from M2, M3, M4, M5. Left: Comparison between single-task learning, AST\nwith L = 10000, BAT with L = 1000, 5000, 10000. Right: Source task probabilities estimated by\nBAT algorithm as a function of FQI iterations.\n\nin the intended direction, while with probability 1\u2212p it moves in the opposite direction. In the target\ntask M1, the state\u2013transition model is de\ufb01ned by the following parameters: p = 0.9, l = 1, and \u03b7 is\nuniform in the interval [\u22120.1, 0.1]. The reward function provides +1 when the system state reaches\nthe regions [\u221211,\u22129] and [9, 11] and 0 elsewhere. Furthermore, to evaluate the performance of the\ntransfer algorithms previously described, we considered eight source tasks {M2, . . . ,M9} whose\nstate\u2013transition model parameters and reward functions are reported in Tab. 1 and 2. To approximate\nthe Q-functions, we use a linear combination of 20 radial basis functions. In particular, for each ac-\ntion, we consider 9 Gaussians with means uniformly spread in the interval [\u221220, 20] and variance\nequal to 16, plus a constant feature. The number of iterations for the FQI algorithm has been empiri-\ncally \ufb01xed to 13. Samples are collected starting from the state x0 = 0 with actions chosen uniformly\nat random. All the results are averaged over 100 runs and we report standard deviation error bars.\nWe \ufb01rst consider the pure transfer problem where no target samples are actually used in the learning\ntraining set (i.e., \u03bb1 = 0). The objective is to study the impact of the transfer error due to the use\nof source samples and the effectiveness of BAT in \ufb01nding a suitable combination of source tasks.\nThe left plot in Fig. 3 compares the performances of FQI with and without the transfer of samples\nfrom the \ufb01rst four tasks listed in Tab. 1. In case of single-task learning, the number of target samples\nrefers to the samples used at learning time, while for BAT it represents the size S of the auxiliary\ntraining set used to estimate the transfer error. Thus, while in single-task learning the performance\nincreases with the target samples, in BAT they just make estimation of E\u03bb more accurate. The\nnumber of source samples added to the auxiliary set for each target sample was empirically \ufb01xed\nto one (T = 1). We \ufb01rst run AST with L = 10000 and \u03bb2 = \u03bb3 = \u03bb4 = \u03bb5 = 0.25 (which\non average corresponds to 2500 samples from each source). As it can be noticed by looking at the\nmodels in Tab. 1, this combination is very different from the target model and AST does not learn\nany good policy. On the other hand, even with a small set of auxiliary target samples, BAT is able to\nlearn good policies. Such result is due to the existence of linear combinations of source tasks which\nclosely approximate the target task M1 at each iteration of FQI. An example of the proportion\ncoef\ufb01cients computed at each iteration of BAT is shown in the right plot in Fig. 3. At the \ufb01rst\niteration, FQI produces an approximation of the reward function. Given the \ufb01rst four source tasks,\nBAT \ufb01nds a combination (\u03bb \u2243 (0.2, 0.4, 0.2, 0.2)) that produces the same reward function as R1.\nHowever, after a few FQI iterations, such combination is no more able to accurately approximate\nfunctions T1eQ. In fact, the state\u2013transition model of task M2 is different from all the other ones\n\n(the step length is doubled). As a result, the coef\ufb01cient \u03bb2 drops to zero, while a new combination\namong the other source tasks is found. Note that BAT signi\ufb01cantly improves single-task learning, in\nparticular when very few target samples are available.\n\nIn the general case, the target task cannot be obtained as any combination of the source tasks, as it\nhappens by considering the second set of source tasks (M6, M7, M8, M9). The impact of such\nsituation on the learning performance of BAT is shown in the left plot in Fig. 4. Note that, when\na few target samples are available, the transfer of samples from a combination of the source tasks\nusing the BAT algorithm is still bene\ufb01cial. On the other hand, the performance attainable by BAT is\nbounded by the transfer error corresponding to the best source task combination (which in this case\nis large). As a result, single-task FQI quickly achieves a better performance.\n\n7\n\n\f 0.7\n\n 0.6\n\n 0.5\n\n 0.4\n\n 0.3\n\n 0.2\n\n 0.1\n\np\ne\n\nt\ns\n \nr\ne\np\n\n \n\nd\nr\na\nw\ne\nr\n \ne\ng\na\nr\ne\nv\nA\n\n 0\n\n 0\n\n 0.7\n\n 0.6\n\n 0.5\n\n 0.4\n\n 0.3\n\n 0.2\n\n 0.1\n\np\ne\n\nt\ns\n \nr\ne\np\n\n \n\nd\nr\na\nw\ne\nr\n \ne\ng\na\nr\ne\nv\nA\n\n 0\n\n 0\n\nwithout transfer\nBAT with 1000 source samples\nBAT with 5000 source samples\nBAT with 10000 source samples\n\n 2000\n\n 4000\n\n 6000\n\n 8000\n\n 10000\n\nNumber of target samples\n\nwithout transfer\nBAT with 1000 source samples + target samples\nBAT with 10000 source samples + target samples\nBTT with max 5000 samples for each source\nBTT with max 10000 samples for each source\n\n 1000\n\n 2000\n\n 3000\n\n 4000\n\n 5000\n\nNumber of target samples\n\nFigure 4: Transfer from M6, M7, M8, M9. Left: Comparison between single-task learning and\nBAT with L = 1000, 5000, 10000. Right: Comparison between single-task learning, BAT with\nL = 1000, 10000 in addition to the target samples, and BTT (\u03c4 = 0.75) with 5000 and 10000\nsamples for each source task. To improve readability, the plot is truncated at 5000 target samples.\n\nResults presented so far for the BAT transfer algorithm assume that FQI is trained only with the\nsamples obtained through combinations of source tasks. Since a number of target samples is already\navailable in the auxiliary training set, a trivial improvement is to include them in the training set\ntogether with the source samples (selected according to the proportions computed by BAT). As\nshown in the plot in the right side of Fig. 4 this leads to a signi\ufb01cant improvement. From the behavior\nof BAT it is clear that with a small set of target samples, it is better to transfer as many samples as\npossible from source tasks, while as the number of target samples increases, it is preferable to reduce\nthe number of samples obtained from a combination of source tasks that actually does not match the\ntarget task. In fact, for L = 10000, BAT has a much better performance at the beginning but it is\nthen outperformed by single-task learning. On the other hand, for L = 1000 the initial advantage\nis small but the performance remains close to single-task FQI for large number of target samples.\nThis experiment highlights the tradeoff between the need of samples to reduce the estimation error\nand the resulting transfer error when the target task cannot be expressed as a combination of source\ntasks (see Sec. 5). BTT algorithm provides a principled way to address such tradeoff, and, as shown\nby the right plot in Fig. 4, it exploits the advantage of transferring source samples when a few target\nsamples are available, and it reduces the weight of the source tasks (so as to avoid large transfer\nerrors) when samples from the target task are enough. It is interesting to notice that increasing the\nnumber of samples available for each source task from 5000 to 10000 improves the performance\nin the \ufb01rst part of the graph, while keeping unchanged the \ufb01nal performance. This is due to the\ncapability of the BTT algorithm to avoid the transfer of source samples when there is no need for\nthem, thus avoiding negative transfer effects.\n7 Conclusions\nIn this paper, we formalized and studied the sample-transfer problem. We \ufb01rst derived a \ufb01nite-\nsample analysis of the performance of a simple transfer algorithm which includes all the source\nsamples into the training set used to solve a given target task. At the best of our knowledge, this\nis the \ufb01rst theoretical result for a transfer algorithm in RL showing the potential bene\ufb01t of transfer\nover single-task learning. When the designer has direct access to the source tasks, we introduced\nan adaptive algorithm which selects the proportion of source tasks so as to minimize the bias due\nto the use of source samples. Finally, we considered a more challenging setting where the number\nof samples available in each source task is limited and a tradeoff between the amount of transferred\nsamples and the similarity between source and target tasks must be solved. For this setting, we\nproposed a principled adaptive algorithm. Finally, we report a detailed experimental analysis on a\nsimple problem which con\ufb01rms and supports the theoretical \ufb01ndings.\nAcknowledgments This work was supported by French National Research Agency through the\nprojects EXPLO-RA n\u25e6 ANR-08-COSI-004, by Ministry of Higher Education and Research, Nord-\nPas de Calais Regional Council and FEDER through the \u201ccontrat de projets \u00b4etat region 2007\u20132013\u201d,\nand by PASCAL2 European Network of Excellence. The research leading to these results has\nalso received funding from the European Community\u2019s Seventh Framework Programme (FP7/2007-\n2013) under grant agreement n 231495.\n\n8\n\n\fReferences\n\n[1] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer\nVaughan. A theory of learning from different domains. Machine Learning, 79:151\u2013175, 2010.\n[2] Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from multiple sources.\n\nJournal of Machine Learning Research, 9:1757\u20131774, 2008.\n\n[3] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement\n\nlearning. Journal of Machine Learning Research, 6:503\u2013556, 2005.\n\n[4] M.G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning\n\nResearch, 4:1107\u20131149, 2003.\n\n[5] A. Lazaric, M. Restelli, and A. Bonarini. Transfer of samples in batch reinforcement learning.\nIn Proceedings of the Twenty-Fifth Annual International Conference on Machine Learning\n(ICML\u201908), pages 544\u2013551, 2008.\n\n[6] Alessandro Lazaric. Knowledge Transfer in Reinforcement Learning. PhD thesis, Poltecnico\n\ndi Milano, 2008.\n\n[7] Alessandro Lazaric and Marcello Restelli. Transfer from Multiple MDPs. Technical Report\n\n00618037, INRIA, 2011.\n\n[8] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learn-\nIn Proceedings of the 22nd Conference on Learning Theory\n\ning bounds and algorithms.\n(COLT\u201909), 2009.\n\n[9] R. Munos and Cs. Szepesv\u00b4ari. Finite time bounds for \ufb01tted value iteration. Journal of Machine\n\nLearning Research, 9:815\u2013857, 2008.\n\n[10] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT\n\nPress, Cambridge, MA, 1998.\n\n[11] Matthew E. Taylor, Nicholas K. Jong, and Peter Stone. Transferring instances for model-based\nIn Proceedings of the European Conference on Machine Learning\n\nreinforcement learning.\n(ECML\u201908), pages 488\u2013505, 2008.\n\n[12] Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A\n\nsurvey. Journal of Machine Learning Research, 10(1):1633\u20131685, 2009.\n\n9\n\n\f", "award": [], "sourceid": 986, "authors": [{"given_name": "Alessandro", "family_name": "Lazaric", "institution": null}, {"given_name": "Marcello", "family_name": "Restelli", "institution": null}]}