{"title": "Sequential Transfer in Multi-armed Bandit with Finite Set of Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2220, "page_last": 2228, "abstract": "Learning from prior tasks and transferring that experience to improve future performance is critical for building lifelong learning agents. Although results in supervised and reinforcement learning show that transfer may significantly improve the learning performance, most of the literature on transfer is focused on batch learning tasks. In this paper we study the problem of sequential transfer in online learning, notably in the multi-arm bandit framework, where the objective is to minimize the cumulative regret over a sequence of tasks by incrementally transferring knowledge from prior tasks. We introduce a novel bandit algorithm based on a method-of-moments approach for the estimation of the possible tasks and derive regret bounds for it.", "full_text": "Sequential Transfer in Multi-armed Bandit\n\nwith Finite Set of Models\n\nMohammad Gheshlaghi Azar \u21e4\nSchool of Computer Science\n\nCMU\n\nAlessandro Lazaric \u2020\n\nINRIA Lille - Nord Europe\n\nTeam SequeL\n\nEmma Brunskill \u21e4\n\nSchool of Computer Science\n\nCMU\n\nAbstract\n\nLearning from prior tasks and transferring that experience to improve future per-\nformance is critical for building lifelong learning agents. Although results in su-\npervised and reinforcement learning show that transfer may signi\ufb01cantly improve\nthe learning performance, most of the literature on transfer is focused on batch\nlearning tasks. In this paper we study the problem of sequential transfer in online\nlearning, notably in the multi\u2013armed bandit framework, where the objective is to\nminimize the total regret over a sequence of tasks by transferring knowledge from\nprior tasks. We introduce a novel bandit algorithm based on a method-of-moments\napproach for estimating the possible tasks and derive regret bounds for it.\n\n1\n\nIntroduction\n\nLearning from prior tasks and transferring that experience to improve future performance is a key\naspect of intelligence, and is critical for building lifelong learning agents. Recently, multi-task\nand transfer learning received much attention in the supervised and reinforcement learning (RL)\nsetting with both empirical and theoretical encouraging results (see recent surveys by Pan and Yang,\n2010; Lazaric, 2011). Most of these works focused on scenarios where the tasks are batch learning\nproblems, in which a training set is directly provided to the learner. On the other hand, the online\nlearning setting (Cesa-Bianchi and Lugosi, 2006), where the learner is presented with samples in\na sequential fashion, has been rarely considered (see Mann and Choe (2012); Taylor (2009) for\nexamples in RL and Sec. E of Azar et al. (2013) for a discussion on related settings).\nThe multi\u2013armed bandit (MAB) (Robbins, 1952) is a simple yet powerful framework formalizing\nthe online learning with partial feedback problem, which encompasses a large number of applica-\ntions, such as clinical trials, web advertisements and adaptive routing. In this paper we take a step\ntowards understanding and providing formal bounds on transfer in stochastic MABs. We focus on a\nsequential transfer scenario where an (online) learner is acting in a series of tasks drawn from a sta-\ntionary distribution over a \ufb01nite set of MABs. The learning problem, within each task, can be seen\nas a standard MAB problem with a \ufb01xed number of steps. Prior to learning, the model parameters\nof each bandit problem are not known to the learner, nor does it know the distribution probability\nover the bandit problems. Also, we assume that the learner is not provided with the identity of the\ntasks throughout the learning. To act ef\ufb01ciently in this setting, it is crucial to de\ufb01ne a mechanism\nfor transferring knowledge across tasks. In fact, the learner may encounter the same bandit prob-\nlem over and over throughout the learning, and an ef\ufb01cient algorithm should be able to leverage\nthe knowledge obtained in previous tasks, when it is presented with the same problem again. To\naddress this problem one can transfer the estimates of all the possible models from prior tasks to\nthe current one. Once these models are accurately estimated, we show that an extension of the UCB\nalgorithm (Auer et al., 2002) is able to ef\ufb01ciently exploit this prior knowledge and reduce the regret\nthrough tasks (Sec. 3).\n\n\u21e4{mazar,ebrun}@cs.cmu.edu\n\u2020alessandro.lazaric@inria.fr\n\n1\n\n\fThe main contributions of this paper are two-fold: (i) we introduce the tUCB algorithm which trans-\nfers the model estimates across the tasks and uses this knowledge to achieve a better performance\nthan UCB. We also prove that the new algorithm is guaranteed to perform as well as UCB in early\nepisodes, thus avoiding any negative transfer effect, and then to approach the performance of the\nideal case when the models are all known in advance (Sec. 4.4). (ii) To estimate the models we rely\non a new variant of method of moments, robust tensor power method (RTP) (Anandkumar et al.,\n2013, 2012b) and extend it to the multi-task bandit setting1:we prove that RTP provides a consistent\nestimate of the means of all arms (for all models) as long as they are pulled at least three times\nper task and prove sample complexity bounds for it (Sec. 4.2). Finally, we report some preliminary\nresults on synthetic data con\ufb01rming the theoretical \ufb01ndings (Sec. 5). An extended version of this\npaper containing proofs and additional comments is available in (Azar et al., 2013).\n2 Preliminaries\nWe consider a stochastic MAB problem de\ufb01ned by a set of arms A = {1, . . . , K}, where each arm\ni 2A is characterized by a distribution \u232bi and the samples (rewards) observed from each arm are\nindependent and identically distributed. We focus on the setting where there exists a set of models\n\u21e5= {\u2713 = (\u232b1, . . . ,\u232b K)}, |\u21e5| = m, which contains all the possible bandit problems. We denote the\nmean of an arm i, the best arm, and the best value of a model \u2713 2 \u21e5 respectively by \u00b5i(\u2713), i\u21e4(\u2713),\n\u00b5\u21e4(\u2713). We de\ufb01ne the arm gap of an arm i for a model \u2713 as i(\u2713) = \u00b5\u21e4(\u2713) \u00b5i(\u2713), while the model\ngap for an arm i between two models \u2713 and \u27130 is de\ufb01ned as i(\u2713, \u27130) = |\u00b5i(\u2713) \u00b5i(\u27130)|. We also\nassume that arm rewards are bounded in [0, 1]. We consider the sequential transfer setting where at\neach episode j the learner interacts with a task \u00af\u2713j, drawn from a distribution \u21e2 over \u21e5, for n steps.\nThe objective is to minimize the (pseudo-)regret RJ over J episodes measured as the difference\nbetween the rewards obtained by pulling i\u21e4(\u00af\u2713j) and those achieved by the learner:\n\nRJ =\n\nJXj=1\n\nRj\nn =\n\nJXj=1Xi6=i\u21e4\n\ni,ni(\u00af\u2713j),\nT j\n\n(1)\n\ni X 2\n\nwhere T j\ni,n is the number of pulls to arm i after n steps of episode j. We also introduce some\ntensor notation. Let X 2 RK be a random realization of the rewards of all arms from a ran-\ndom model. All the realizations are i.i.d. conditional on a model \u00af\u2713 and E[X|\u2713 = \u00af\u2713] = \u00b5(\u2713),\nwhere the i-th component of \u00b5(\u2713) 2 RK is [\u00b5(\u2713)]i = \u00b5i(\u2713). Given realizations X 1 , X 2 and\nX 3, we de\ufb01ne the second moment matrix M2 = E[X 1 \u2326 X 2] such that [M 2]i,j = E[X 1\nj ] and\ni X 2\nl ]. Since\nthe third moment tensor M3 = E[X 1 \u2326 X 2 \u2326 X 3] such that [M2]i,j,l = E[X 1\nj X 3\nthe realizations are conditionally independent, we have that, for every \u2713 2 \u21e5, E[X 1 \u2326 X 2|\u2713] =\nE[X 1|\u2713] \u2326 E[X 2|\u2713] = \u00b5(\u2713) \u2326 \u00b5(\u2713) and this allows us to rewrite the second and third moments as\nM2 =P\u2713 \u21e2(\u2713)\u00b5(\u2713)\u23262, M3 =P\u2713 \u21e2(\u2713)\u00b5(\u2713)\u23263, where v\u2326p = v \u2326 v \u2326\u00b7\u00b7\u00b7 v is the p-th tensor power.\nLet A be a 3rd order member of the tensor product of the Euclidean space RK (as M3), then we de-\n\ufb01ne the multilinear map as follows. For a set of three matrices {Vi 2 RK\u21e5m}1\uf8ffi\uf8ff3 , the (i1, i2, i3)\nentry in the 3-way array representation of A(V1, V2, V3) 2 Rm\u21e5m\u21e5m is [A(V1, V2, V3)]i1,i2,i3 :=\nP1\uf8ffj1,j2,j3\uf8ffn Aj1,j2,j3[V1]j1,i1[V2]j2,i2[V3]j3,i3. We also use different norms: the Euclidean norm\nk\u00b7k ; the Frobenius norm k\u00b7k F ; the matrix max-norm kAkmax = maxij |[A]ij|.\n3 Multi-arm Bandit with Finite Models\nBefore considering the transfer problem, we\nshow that a simple variation to UCB allows us\nto effectively exploit the knowledge of \u21e5 and\nobtain a signi\ufb01cant reduction in the regret. The\nmUCB (model-UCB) algorithm in Fig. 1 takes\nas input a set of models \u21e5 including the current\n(unknown) model \u00af\u2713. At each step t, the algo-\nrithm computes a subset \u21e5t \u2713 \u21e5 containing\nonly the models whose means \u00b5i(\u2713) are com-\npatible with the current estimates \u02c6\u00b5i,t of the means \u00b5i(\u00af\u2713) of the current model, obtained averaging\n\nRequire: Set of models \u21e5, number of steps n\nBuild \u21e5t = {\u2713 : 8i,|\u00b5i(\u2713) \u02c6\u00b5i,t|\uf8ff \"i,t}\nSelect \u2713t = arg max\u27132\u21e5t \u00b5\u21e4(\u2713)\nPull arm It = i\u21e4(\u2713t)\nObserve sample xIt and update\n\nfor t = 1, . . . , n do\n\nend for\n\nFigure 1: The mUCB algorithm.\n\n1Notice that estimating the models involves solving a latent variable model estimation problem, for which\n\nRTP is the state-of-the-art.\n\n2\n\n\fTi,t pulls, and their uncertainty \"i,t (see Eq. 2 for an explicit de\ufb01nition of this term). Notice that it\nis enough that one arm does not satisfy the compatibility condition to discard a model \u2713. Among\nall the models in \u21e5t, mUCB \ufb01rst selects the model with the largest optimal value and then it pulls\nits corresponding optimal arm. This choice is coherent with the optimism in the face of uncertainty\nprinciple used in UCB-based algorithms, since mUCB always pulls the optimal arm corresponding\nto the optimistic model compatible with the current estimates \u02c6\u00b5i,t. We show that mUCB incurs a\nregret which is never worse than UCB and it is often signi\ufb01cantly smaller.\nWe denote the set of arms which are optimal for at least a model in a set \u21e50 as A\u21e4(\u21e50) = {i 2A :\n9\u2713 2 \u21e50 : i\u21e4(\u2713) = i}. The set of models for which the arms in A0 are optimal is \u21e5(A0) = {\u2713 2 \u21e5:\n9i 2A 0 : i\u21e4(\u2713) = i}. The set of optimistic models for a given model \u00af\u2713 is \u21e5+ = {\u2713 2 \u21e5: \u00b5\u21e4(\u2713) \n\u00b5\u21e4(\u00af\u2713)}, and their corresponding optimal arms A+ = A\u21e4(\u21e5+). The following theorem bounds the\nexpected regret (similar bounds hold in high probability). The lemmas and proofs (using standard\ntools from the bandit literature) are available in Sec. B of Azar et al. (2013).\nTheorem 1. If mUCB is run with = 1/n, a set of m models \u21e5 such that the \u00af\u2713 2 \u21e5 and\n\n\"i,t =qlog(mn2/)/(2Ti,t1),\n2i(\u00af\u2713) logmn3\nmin\u27132\u21e5+,i i(\u2713, \u00af\u2713)2 \uf8ff K + Xi2A+\n\nwhere Ti,t1 is the number of pulls to arm i at the beginning of step t, then its expected regret is\n\nE[Rn] \uf8ff K + Xi2A+\n\n2 logmn3\nwhere A+ = A\u21e4(\u21e5+) is the set of arms which are optimal for at least one optimistic model \u21e5+ and\n\u21e5+,i = {\u2713 2 \u21e5+ : i\u21e4(\u2713) = i} is the set of optimistic models for which i is the optimal arm.\nRemark (comparison to UCB). The UCB algorithm incurs a regret\n\nmin\u27132\u21e5+,i i(\u2713, \u00af\u2713)\n\n(3)\n\n(2)\n\n,\n\nWe see that mUCB displays two major improvements. The regret in Eq. 3 can be written as\n\nE[Rn(UCB)] \uf8ff O\u21e3Xi2A\n\nlog n\n\nlog n\n\ni(\u00af\u2713)\u2318 \uf8ff O\u21e3K\nmin\u27132\u21e5+,i i(\u2713, \u00af\u2713)\u2318 \uf8ff O\u21e3|A+|\n\nmini i(\u00af\u2713)\u2318.\nmini min\u27132\u21e5+,i i(\u2713, \u00af\u2713)\u2318.\n\nlog n\n\nlog n\n\nE[Rn(mUCB)] \uf8ff O\u21e3Xi2A+\n\nThis result suggests that mUCB tends to discard all the models in \u21e5+ from the most optimistic\ndown to the actual model \u00af\u2713 which, with high-probability, is never discarded. As a result, even if\nother models are still in \u21e5t, the optimal arm of \u00af\u2713 is pulled until the end. This signi\ufb01cantly reduces\nthe set of arms which are actually pulled by mUCB and the previous bound only depends on the\nnumber of arms in A+, which is |A+|\uf8ff|A \u21e4(\u21e5)|\uf8ff K. Furthermore, for all arms i, the minimum\ngap min\u27132\u21e5+,i i(\u2713, \u00af\u2713) is guaranteed to be larger than the arm gap i(\u00af\u2713) (see Lem. 4 in Sec. B\nof Azar et al. (2013)), thus further improving the performance of mUCB w.r.t. UCB.\n\n4 Online Transfer with Unknown Models\n\nWe now consider the case when the set of models is unknown and the regret is cumulated over\nmultiple tasks drawn from \u21e2 (Eq. 1). We introduce tUCB (transfer-UCB) which transfers estimates\nof \u21e5, whose accuracy is improved through episodes using a method-of-moments approach.\n\n4.1 The transfer-UCB Bandit Algorithm\n\nFig. 2 outlines the structure of our online transfer bandit algorithm tUCB (transfer-UCB). The al-\ngorithm uses two sub-algorithms, the bandit algorithm umUCB (uncertain model-UCB), whose ob-\njective is to minimize the regret at each episode, and RTP (robust tensor power method) which at\neach episode j computes an estimate {\u02c6\u00b5j\ni (\u2713)} of the arm means of all the models. The bandit al-\ngorithm umUCB in Fig. 3 is an extension of the mUCB algorithm. It \ufb01rst computes a set of models\n\u21e5j\nt whose means \u02c6\u00b5i(\u2713) are compatible with the current estimates \u02c6\u00b5i,t. However, unlike the case\nwhere the exact models are available, here the models themselves are estimated and the uncertainty\n\"j in their means (provided as input to umUCB) is taken into account in the de\ufb01nition of \u21e5j\nt. Once\n\n3\n\n\fRequire: number of arms K, number of\n\nmodels m, constant C(\u2713).\nInitialize estimated models \u21e51 =\n{\u02c6\u00b51\nfor j = 1, 2, . . . , J do\n\ni (\u2713)}i,\u2713, samples R 2 RJ\u21e5K\u21e5n\nRun Rj = umUCB(\u21e5j, n)\nRun \u21e5j+1 = RTP(R, m, K, j, )\n\nend for\n\nRequire: set of models \u21e5j, num. steps n\n\nPull each arm three times\nfor t = 3K + 1, . . . , n do\nt = {\u2713 : 8i,|\u02c6\u00b5j\nt (i; \u2713) = min(\u02c6\u00b5j\n\nBuild \u21e5j\nCompute Bj\nCompute \u2713j\nt = arg max\u27132\u21e5j\nPull arm It = arg maxi Bj\nObserve sample R(It, Ti,t) = xIt and update\n\ni (\u2713) \u02c6\u00b5i,t|\uf8ff \"i,t + \"j}\n\nt (i; \u2713j\nt )\n\ni (\u2713) + \"j), (\u02c6\u00b5i,t + \"i,t) \n\nmaxi Bj\n\nt (i; \u2713)\n\nt\n\nend for\nreturn Samples R\n\nFigure 2: The tUCB algorithm.\n\nFigure 3: The umUCB algorithm.\n\nRequire: samples R 2 Rj\u21e5n, number of models m and arms K, episode j\nEstimate the second and third momentcM2 andcM3 using the reward samples from R (Eq. 4)\nCompute bD 2 Rm\u21e5m and bU 2 RK\u21e5m (m largest eigenvalues and eigenvectors ofcM2 resp.)\nCompute the whitening mappingcW = bUbD1/2 and the tensor bT = cM3(cW ,cW ,cW )\nPlug bT in Alg. 1 of Anandkumar et al. (2012b) and compute eigen-vectors/values {bv(\u2713)}, {b(\u2713)}\nComputeb\u00b5j(\u2713) =b(\u2713)(cW T)+bv(\u2713) for all \u2713 2 \u21e5\nreturn \u21e5j+1 = {b\u00b5j(\u2713) : \u2713 2 \u21e5}\n\nFigure 4: The robust tensor power (RTP) method (Anandkumar et al., 2012b).\n\nthe active set is computed, the algorithm computes an upper-con\ufb01dence bound on the value of each\narm i for each model \u2713 and returns the best arm for the most optimistic model. Unlike in mUCB,\ndue to the uncertainty over the model estimates, a model \u2713 might have more than one optimal arm,\nand an upper-con\ufb01dence bound on the mean of the arms \u02c6\u00b5i(\u2713) + \"j is used together with the upper-\ncon\ufb01dence bound \u02c6\u00b5i,t + \"i,t, which is directly derived from the samples observed so far from arm\ni. This guarantees that the B-values are always consistent with the samples generated from the ac-\ntual model \u00af\u2713j. Once umUCB terminates, RTP (Fig. 4) updates the estimates of the model means\ni (\u2713)}i 2 RK using the samples obtained from each arm i. At the beginning of each task\numUCB pulls all the arms 3 times, since RTP needs at least 3 samples from each arm to accurately\nestimate the 2nd and 3rd moments (Anandkumar et al., 2012b). More precisely, RTP uses all the\nreward samples generated up to episode j to estimate the 2nd and 3rd moments (see Sec. 2) as\n\nb\u00b5j(\u2713) = {\u02c6\u00b5j\n\n\u00b51l \u2326 \u00b52l,\n\nand\n\nl=1\n\n\u00b51l \u2326 \u00b52l \u2326 \u00b53l,\n\n(4)\n\nl=1\n\ncM3 = j1Xj\n\ncM2 = j1Xj\n\nwhere the vectors \u00b51l, \u00b52l, \u00b53l 2 RK are obtained by dividing the T l\ni in episode l in three batches and taking their average (e.g., [\u00b51l]i is the average of the \ufb01rst T l\nsamples).2 Since \u00b51l, \u00b52l, \u00b53l are independent estimates of \u00b5(\u00af\u2713l), cM2 and cM3 are consistent esti-\nmates of the second and third moments M2 and M3. RTP relies on the fact that the model means\n\u00b5(\u2713) can be recovered from the spectral decomposition of the symmetric tensor T = M3(W, W, W ),\nwhere W is a whitening matrix for M2, i.e., M2(W, W ) = Im\u21e5m (see Sec. 2 for the de\ufb01ni-\ntion of the mapping A(V1, V2, V3)). Anandkumar et al. (2012b) (Thm. 4.3) have shown that un-\nder some mild assumption (see later Assumption 1) the model means {\u00b5(\u2713)}, can be obtained as\n\u00b5(\u2713) = (\u2713)Bv(\u2713), where ((\u2713), v(\u2713)) is a pair of eigenvector/eigenvalue for the tensor T and\n\ni,n samples observed from arm\ni,n/3\n\nB := (W T)+.Thus the RTP algorithm estimates the eigenvectorsbv(\u2713) and the eigenvaluesb(\u2713), of\nthe m \u21e5 m \u21e5 m tensor bT := cM3(cW ,cW ,cW ).3 Oncebv(\u2713) andb(\u2713) are computed, the estimated\nmean vectorb\u00b5j(\u2713) is obtained by the inverse transformationb\u00b5j(\u2713) = b(\u2713)bBbv(\u2713), where bB is the\npseudo inverse ofcW T(for a detailed description of RTP algorithm see Anandkumar et al., 2012b).\n3The matrixcW 2 RK\u21e5m is such that cM2(cW ,cW ) = Im\u21e5m, i.e.,cW is the whitening matrix of cM2. In\ngeneralcW is not unique. Here, we choosecW = bUbD1/2, where bD 2 Rm\u21e5m is a diagonal matrix consisting\nof the m largest eigenvalues ofcM2 and bU 2 RK\u21e5m has the corresponding eigenvectors as its columns.\n\ni,n, the empirical mean of arm i at the end of episode l.\n\n2Notice that 1/3([\u00b51l]i + [\u00b52l]i + [\u00b51l]i) = \u02c6\u00b5l\n\n4\n\n\f4.2 Sample Complexity of the Robust Tensor Power Method\n\n, max := max2\u2303M2\n\numUCB requires as input \"j, i.e., the uncertainty of the model estimates. Therefore we need sam-\nple complexity bounds on the accuracy of {\u02c6\u00b5i(\u2713)} computed by RTP. The performance of RTP is\ndirectly affected by the error of the estimates cM2 and cM3 w.r.t. the true moments. In Thm. 2 we\nprove that, as the number of tasks j grows, this error rapidly decreases with the rate ofp1/j. This\nresult provides us with an upper-bound on the error \"j needed for building the con\ufb01dence intervals\nin umUCB. The following de\ufb01nition and assumption are required for our result.\nDe\ufb01nition 1. Let \u2303M2 = {1, 2, . . . , m} be the set of m largest eigenvalues of the matrix M2.\nDe\ufb01ne min := min2\u2303M2\n and max := max\u2713 (\u2713). De\ufb01ne the minimum\ngap between the distinct eigenvalues of M2 as := mini6=l(|i l|).\nAssumption 1. The mean vectors {\u00b5(\u2713)}\u2713 are linear independent and \u21e2(\u2713) > 0 for all \u2713 2 \u21e5.\nWe now state our main result which is in the form of a high probability bound on the estimation\nerror of mean reward vector of every model \u2713 2 \u21e5.\nTheorem 2. Pick 2 (0, 1). Let C(\u21e5) := C3maxpmax/3\nmin (max/ + 1/min + 1/max),\nwhere C3 > 0 is a universal constant. Then under Assumption 1 there exist constants C4 > 0 and a\npermutation \u21e1 on \u21e5, such that for all \u2713 2 \u21e5, we have w.p. 1 \nk\u00b5(\u2713) b\u00b5j(\u21e1(\u2713))k \uf8ff \"j , C(\u21e5)K2.5m2q log(K/)\nafter\nRemark (computation of C(\u21e5)). As illustrated in Fig. 3, umUCB relies on the estimatesb\u00b5j(\u2713) and\n\non their accuracy \"j. Although the bound reported in Thm. 2 provides an upper con\ufb01dence bound\non the error of the estimates, it contains terms which are not computable in general (e.g., min). In\npractice, C(\u21e5) should be considered as a parameter of the algorithm.This is not dissimilar from the\nparameter usually introduced in the de\ufb01nition of \"i,t in front of the square-root term in UCB.\n\nj C4m5K6 log(K/)\n\nmin(min,)23\n\nmin2\n\n(5)\n\nmin\n\n.\n\nj\n\n4.3 Regret Analysis of umUCB\n\ni (\u2713) + \"j < \u02c6\u00b5j\n\n\u21e4(\u2713) = {i 2A : @i0, \u02c6\u00b5j\n\n+(\u2713; \u00af\u2713j) = {i 2A j\n+(\u00af\u2713j) = {\u2713 2 \u21e5: Aj\n\nWe now analyze the regret of umUCB when an estimated set of models \u21e5j is provided as input. At\nepisode j, for each model \u2713 we de\ufb01ne the set of non-dominated arms (i.e., potentially optimal arms)\nas Aj\ni0(\u2713) \"j}. Among the non-dominated arms, when the\nactual model is \u00af\u2713j, the set of optimistic arms is Aj\ni (\u2713) + \"j \u00b5\u21e4(\u00af\u2713j)}.\n\u21e4(\u2713) : \u02c6\u00b5j\n+(\u2713; \u00af\u2713j) 6= ;}. In some cases,\nAs a result, the set of optimistic models is \u21e5j\nbecause of the uncertainty in the model estimates, unlike in mUCB, not all the models \u2713 6= \u00af\u2713j can be\ndiscarded, not even at the end of a very long episode. Among the optimistic models, the set of models\nthat cannot be discarded is de\ufb01ned ase\u21e5j\ni (\u2713)\u00b5i(\u00af\u2713j)|\uf8ff\n\"j}. Finally, when we want to apply the previous de\ufb01nitions to a set of models \u21e50 instead of single\nmodel we have, e.g., Aj\nThe proof of the following results are available in Sec. D of Azar et al. (2013), here we only report\nthe number of pulls, and the corresponding regret bound.\nCorollary 1. If at episode j umUCB is run with \"i,t as in Eq. 2 and \"j as in Eq. 5 with a parameter\n0 = /2K, then for any arm i 2A , i 6= i\u21e4(\u00af\u2713j) is pulled Ti,n times such that\n\n\u21e4(\u21e50; \u00af\u2713j) =S\u27132\u21e50 Aj\n\n+(\u00af\u2713j) = {\u2713 2 \u21e5j\n\n+(\u00af\u2713j) : 8i 2A j\n\n+(\u2713; \u00af\u2713j),|\u02c6\u00b5j\n\n\u21e4(\u2713; \u00af\u2713j).\n\ni,+(\u00af\u2713j) = {\u2713 2 \u21e5j\n\nw.p. 1 , where \u21e5j\namong their optimistic non-dominated arms,bi(\u2713; \u00af\u2713j) = i(\u2713, \u00af\u2713j)/2\"j, Aj\n1 = Aj\n+(e\u21e5j\n+(\u00af\u2713j); \u00af\u2713j) (i.e., set of arms only proposed by models that can be discarded), and Aj\nAj\n+(e\u21e5j\n+(\u00af\u2713j); \u00af\u2713j) (i.e., set of arms only proposed by models that cannot be discarded).\nAj\n\n+(\u00af\u2713j) : i 2A +(\u2713; \u00af\u2713j)} is the set of models for which i is\n+(\u00af\u2713j); \u00af\u2713j)\n2 =\n\n+(\u21e5j\n\n5\n\nTi,n \uf8ff min\u21e2 2 log2mKn2/\nTi,n \uf8ff 2 log2mKn2//(i(\u00af\u2713j)2) + 1\n\ni(\u00af\u2713j)2\n\nTi,n = 0\n\n,\n\n2 min\u27132\u21e5j\n\nlog2mKn2/\ni,+(\u00af\u2713j )bi(\u2713; \u00af\u2713j)2 + 1 if i 2A j\n\n1\n\nif i 2A j\n2\notherwise\n\n8>>>><>>>>:\n\n\fThe previous corollary states that arms which cannot be optimal for any optimistic model (i.e.,\nthe optimistic non-dominated arms) are never pulled by umUCB, which focuses only on arms in\n+(\u00af\u2713j); \u00af\u2713j). Among these arms, those that may help to remove a model from the active set\n+(\u21e5j\ni 2A j\n(i.e., i 2A j\n1) are potentially pulled less than UCB, while the remaining arms, which are optimal for\nthe models that cannot be discarded (i.e., i 2A j\n2), are simply pulled according to a UCB strategy.\nSimilar to mUCB, umUCB \ufb01rst pulls the arms that are more optimistic until either the active set \u21e5j\nt\nchanges or they are no longer optimistic (because of the evidence from the actual samples). We are\nnow ready to derive the per-episode regret of umUCB.\nTheorem 3. If umUCB is run for n steps on the set of models \u21e5j estimated by RTP after j episodes\nwith = 1/n, and the actual model is \u00af\u2713j, then its expected regret (w.r.t. the random realization in\nepisode j and conditional on \u00af\u2713j) is\n\nn] \uf8ff K +Xi2Aj\nE[Rj\n+ Xi2Aj\n\nlog2mKn3 min\u21e22/i(\u00af\u2713j)2, 1/2 min\u27132\u21e5j\n2 log2mKn3/i(\u00af\u2713j).\n\n2\n\n1\n\ni,+(\u00af\u2713j )bi(\u2713; \u00af\u2713j)2i(\u00af\u2713j)\n\nRemark (negative transfer). The transfer of knowledge introduces a bias in the learning process\nwhich is often bene\ufb01cial. Nonetheless, in many cases transfer may result in a bias towards wrong\nsolutions and a worse learning performance, a phenomenon often referred to as negative transfer.\nThe \ufb01rst interesting aspect of the previous theorem is that umUCB is guaranteed to never perform\nworse than UCB itself. This implies that tUCB never suffers from negative transfer, even when the\nset \u21e5j contains highly uncertain models and might bias umUCB to pull suboptimal arms.\nRemark (improvement over UCB). In Sec. 3 we showed that mUCB exploits the knowledge of \u21e5\nto focus on a restricted set of arms which are pulled less than UCB. In umUCB this improvement is\nnot as clear, since the models in \u21e5 are not known but are estimated online through episodes. Yet,\nsimilar to mUCB, umUCB has the two main sources of potential improvement w.r.t.\nto UCB. As\nillustrated by the regret bound in Thm. 3, umUCB focuses on arms in Aj\n1 [A j\n2 which is potentially\na smaller set than A. Furthermore, the number of pulls to arms in Aj\n1 is smaller than for UCB\nwhenever the estimated model gap bi(\u2713; \u00af\u2713j) is bigger than i(\u00af\u2713j). Eventually, umUCB reaches\nthe same performance (and improvement over UCB) as mUCB when j is big enough. In fact, the\n+(\u00af\u2713j) \u2318 \u21e5+(\u00af\u2713j)) and all the\nset of optimistic models reduces to the one used in mUCB (i.e., \u21e5j\noptimistic models have only optimal arms (i.e., for any \u2713 2 \u21e5+ the set of non-dominated optimistic\narms is A+(\u2713; \u00af\u2713j) = {i\u21e4(\u2713)}), which corresponds to Aj\n2 \u2318{ i\u21e4(\u00af\u2713j)}, which\nmatches the condition of mUCB. For instance, for any model \u2713, in order to have A\u21e4(\u2713) = {i\u21e4(\u2713)},\nfor any arm i 6= i\u21e4(\u2713) we need that \u02c6\u00b5j\nj \n\ni\u21e4(\u2713)(\u2713) \"j. Thus after\nmini i(\u2713)2 + 1.\n\ni (\u2713) + \"j \uf8ff \u02c6\u00b5j\n2C(\u21e5)\n\n1 \u2318A \u21e4(\u21e5+(\u00af\u2713j)) and Aj\n\nmin\n\nmin\n\u00af\u27132\u21e5\n\n\u27132\u21e5+(\u00af\u2713)\n\nepisodes, all the optimistic models have only one optimal arm independently from the actual identity\nof the model \u00af\u2713j. Although this condition may seem restrictive, in practice umUCB starts improving\nover UCB much earlier, as illustrated in the numerical simulation in Sec. 5.\n\n4.4 Regret Analysis of tUCB\n\nGiven the previous results, we derive the bound on the cumulative regret over J episodes (Eq. 1).\nTheorem 4. If tUCB is run over J episodes of n steps in which the tasks \u00af\u2713j are drawn from a \ufb01xed\ndistribution \u21e2 over a set of models \u21e5, then its cumulative regret is\n\nmin\u21e2 2 log2mKn2/\n\ni(\u00af\u2713j)2\n\n,\n\nj=1 Xi2Aj\n\n1\n\nRJ \uf8ff JK + XJ\n+ XJ\nj=1 Xi2Aj\n\n2\n\n2 log2mKn2/\n\ni(\u00af\u2713j)\n\n,\n\nlog2mKn2/\ni,+(\u00af\u2713j )bj\n\n2 min\u27132\u21e5j\n\ni (\u2713; \u00af\u2713j)2i(\u00af\u2713j)\n\nw.p. 1 w.r.t. the randomization over tasks and the realizations of the arms in each episode.\n\n6\n\n\fd\nr\na\nw\ne\nR\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nm1 m2 m3 m4 m5\n\nm1 m2 m3 m4 m5\n\nm1 m2 m3 m4 m5\n\nm1 m2 m3 m4 m5\n\nm1 m2 m3 m4 m5\n\nm1 m2 m3 m4 m5\n\nm1 m2 m3 m4 m5\n\nModels\n\nl\n\ny\nt\ni\nx\ne\np\nm\no\nC\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n \n0\n\n \n\nUCB\nUCB+\nmUCB\ntUCB\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\nNumber of Tasks (J)\n\nFigure 5: Set of models \u21e5.\n\nFigure 6: Complexity over tasks.\n\nt\n\ne\nr\ng\ne\nR\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n \n0\n\nUCB\nUCB+\nmUCB\ntUCB (J=1000)\ntUCB (J=2000)\ntUCB (J=5000)\n\n \n\n5000\n\n10000\n\n15000\n\nNumber of Steps (n)\n\nFigure 7: Regret of UCB, UCB+, mUCB, and\ntUCB (avg. over episodes) vs episode length.\n\nt\n\ne\nr\ng\ne\nR\n \ne\nd\no\ns\np\ne\n\u2212\nr\ne\nP\n\ni\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n \n0\n\n \n\nUCB\nUCB+\nmUCB\ntUCB\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\nNumber of Tasks (J)\n\nFigure 8: Per-episode regret of tUCB.\n\nThis result immediately follows from Thm. 3 and it shows a linear dependency on the number of\nepisodes J. This dependency is the price to pay for not knowing the identity of the current task \u00af\u2713j.\nIf the task was revealed at the beginning of the task, a bandit algorithm could simply cluster all the\nsamples coming from the same task and incur a much smaller cumulative regret with a logarithmic\ndependency on episodes and steps, i.e., log(nJ). Nonetheless, as discussed in the previous section,\nthe cumulative regret of tUCB is never worse than for UCB and as the number of tasks increases it\napproaches the performance of mUCB, which fully exploits the prior knowledge of \u21e5.\n\n5 Numerical Simulations\n\nIn this section we report preliminary results of tUCB on synthetic data. The objective is to illustrate\nand support the previous theoretical \ufb01ndings. We de\ufb01ne a set \u21e5 of m = 5 MAB problems with\nK = 7 arms each, whose means {\u00b5i(\u2713)}i,\u2713 are reported in Fig. 5 (see Sect. F in Azar et al. (2013)\nfor the actual values), where each model has a different color and squares correspond to optimal arms\n(e.g., arm 2 is optimal for model \u27132). This set of models is chosen to be challenging and illustrate\nsome interesting cases useful to understand the functioning of the algorithm.4 Models \u27131 and \u27132\nonly differ in their optimal arms and this makes it dif\ufb01cult to distinguish them. For arm 3 (which is\noptimal for model \u27133 and thus potentially selected by mUCB), all the models share exactly the same\nmean value. This implies that no model can be discarded by pulling it. Although this might suggest\nthat mUCB gets stuck in pulling arm 3, we showed in Thm. 1 that this is not the case. Models \u27131\nand \u27135 are challenging for UCB since they have small minimum gap. Only 5 out of the 7 arms are\nactually optimal for a model in \u21e5. Thus, we also report the performance of UCB+ which, under the\nassumption that \u21e5 is known, immediately discards all the arms which are not optimal (i /2A \u21e4) and\nperforms UCB on the remaining arms. The model distribution is uniform, i.e., \u21e2(\u2713) = 1/m.\nBefore discussing the transfer results, we compare UCB, UCB+, and mUCB, to illustrate the ad-\nvantage of the prior knowledge of \u21e5 w.r.t. UCB. Fig. 7 reports the per-episode regret of the three\n\n4Notice that although \u21e5 satis\ufb01es Assumption 1, the smallest singular value min = 0.0039 and =\n\n0.0038, thus making the estimation of the models dif\ufb01cult.\n\n7\n\n\falgorithms for episodes of different length n (the performance of tUCB is discussed later). The re-\nsults are averaged over all the models in \u21e5 and over 200 runs each. All the algorithms use the same\ncon\ufb01dence bound \"i,t. The performance of mUCB is signi\ufb01cantly better than both UCB, and UCB+,\nthus showing that mUCB makes an ef\ufb01cient use of the prior of knowledge of \u21e5. Furthermore, in\nFig. 6 the horizontal lines correspond to the value of the regret bounds up to the n dependent terms\nand constants5 for the different models in \u21e5 averaged w.r.t. \u21e2 for the three algorithms (the actual\nvalues for the different models are in the supplementary material). These values show that the im-\nprovement observed in practice is accurately predicated by the upper-bounds derived in Thm. 1.\nWe now move to analyze the performance of tUCB. In Fig. 8 we show how the per-episode regret\nchanges through episodes for a transfer problem with J = 5000 tasks of length n = 5000. In\ntUCB we used \"j as in Eq. 5 with C(\u21e5) = 2. As discussed in Thm. 3, UCB and mUCB de\ufb01ne\nthe boundaries of the performance of tUCB. In fact, at the beginning tUCB selects arms according\nto a UCB strategy, since no prior information about the models \u21e5 is available. On the other hand,\nas more tasks are observed, tUCB is able to transfer the knowledge acquired through episodes and\nbuild an increasingly accurate estimate of the models, thus approaching the behavior of mUCB. This\nis also con\ufb01rmed by Fig. 6 where we show how the complexity of tUCB changes through episodes.\nIn both cases (regret and complexity) we see that tUCB does not reach the same performance of\nmUCB. This is due to the fact that some models have relatively small gaps and thus the number of\nepisodes to have an accurate enough estimate of the models to reach the performance of mUCB is\nmuch larger than 5000 (see also the Remarks of Thm. 3). Since the \ufb01nal objective is to achieve a\nsmall global regret (Eq. 1), in Fig. 7 we report the cumulative regret averaged over the total number\nof tasks (J) for different values of J and n. Again, this graph shows that tUCB outperforms UCB\nand that it tends to approach the performance of mUCB as J increases, for any value of n.\n6 Conclusions and Open Questions\nIn this paper we introduce the transfer problem in the multi\u2013armed bandit framework when a tasks\nare drawn from a \ufb01nite set of bandit problems. We \ufb01rst introduced the bandit algorithm mUCB\nand we showed that it is able to leverage the prior knowledge on the set of bandit problems \u21e5 and\nreduce the regret w.r.t. UCB. When the set of models is unknown we de\ufb01ne a method-of-moments\nvariant (RTP) which consistently estimates the means of the models in \u21e5 from the samples collected\nthrough episodes. This knowledge is then transferred to umUCB which performs no worse than UCB\nand tends to approach the performance of mUCB. For these algorithms we derive regret bounds, and\nwe show preliminary numerical simulations. To the best of our knowledge, this is the \ufb01rst work\nstudying the problem of transfer in multi-armed bandit. It opens a series of interesting directions,\nincluding whether explicit model identi\ufb01cation can improve our transfer regret.\nOptimality of tUCB. At each episode, tUCB transfers the knowledge about \u21e5 acquired from previous\ntasks to achieve a small per-episode regret using umUCB. Although this strategy guarantees that the\nper-episode regret of tUCB is never worse than UCB, it may not be the optimal strategy in terms of\nthe cumulative regret through episodes. In fact, if J is large, it could be preferable to run a model\nidenti\ufb01cation algorithm instead of umUCB in earlier episodes so as to improve the quality of the\nestimates \u02c6\u00b5i(\u2713). Although such an algorithm would incur a much larger regret in earlier tasks (up\nto linear), it could approach the performance of mUCB in later episodes much faster than done by\ntUCB. This trade-off between identi\ufb01cation of the models and transfer of knowledge may suggest\nthat different algorithms than tUCB are possible.\nUnknown model-set size. In some problems the size of model set m is not known to the learner and\nneeds to be estimated. This problem can be addressed by estimating the rank of matrix M2 which\nequals to m (Kleibergen and Paap, 2006). We also note that one can relax the assumption that \u21e2(\u2713)\nneeds to be positive (see Assumption 1) by using the estimated model size as opposed to m, since\nM2 depends not on the means of models with \u21e2(\u2713) = 0.\nAcknowledgments. This research was supported by the National Science Foundation (NSF award #SBE-\n0836012). We would like to thank Sham Kakade and Animashree Anandkumar for valuable discussions. A.\nLazaric would like to acknowledge the support of the Ministry of Higher Education and Research, Nord-Pas-\nde-Calais Regional Council and FEDER through the\u2019 Contrat de Projets Etat Region (CPER) 2007-2013\u2019, and\nthe European Community\u2019s Seventh Framework Programme (FP7/2007-2013) under grant agreement 231495\n(project CompLACS).\n\n5For instance, for UCB we computePi 1/i.\n\n8\n\n\fReferences\nAgarwal, A., Dud\u00edk, M., Kale, S., Langford, J., and Schapire, R. E. (2012). Contextual bandit learn-\ning with predictable rewards. In Proceedings of the 15th International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS\u201912).\n\nAnandkumar, A., Foster, D. P., Hsu, D., Kakade, S., and Liu, Y.-K. (2012a). A spectral algorithm for\nlatent dirichlet allocation. In Proceedings of Advances in Neural Information Processing Systems\n25 (NIPS\u201912), pages 926\u2013934.\n\nAnandkumar, A., Ge, R., Hsu, D., and Kakade, S. M. (2013). A tensor spectral approach to learning\n\nmixed membership community models. Journal of Machine Learning Research, 1:65.\n\nAnandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. (2012b). Tensor decompositions\n\nfor learning latent variable models. CoRR, abs/1210.7559.\n\nAnandkumar, A., Hsu, D., and Kakade, S. M. (2012c). A method of moments for mixture models\nand hidden markov models. In Proceeding of the 25th Annual Conference on Learning Theory\n(COLT\u201912), volume 23, pages 33.1\u201333.34.\n\nAuer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multi-armed bandit\n\nproblem. Machine Learning, 47:235\u2013256.\n\nAzar, M. G., Lazaric, A., and Brunskill, E. (2013). Sequential transfer in multi-armed bandit with\n\n\ufb01nite set of models. CoRR, abs/1307.6887.\n\nCavallanti, G., Cesa-Bianchi, N., and Gentile, C. (2010). Linear algorithms for online multitask\n\nclassi\ufb01cation. Journal of Machine Learning Research, 11:2901\u20132934.\n\nCesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University\n\nPress.\n\nDekel, O., Long, P. M., and Singer, Y. (2006). Online multitask learning. In Proceedings of the 19th\n\nAnnual Conference on Learning Theory (COLT\u201906), pages 453\u2013467.\n\nGarivier, A. and Moulines, E. (2011). On upper-con\ufb01dence bound policies for switching bandit\nproblems. In Proceedings of the 22nd international conference on Algorithmic learning theory,\nALT\u201911, pages 174\u2013188, Berlin, Heidelberg. Springer-Verlag.\n\nKleibergen, F. and Paap, R. (2006). Generalized reduced rank tests using the singular value decom-\n\nposition. Journal of Econometrics, 133(1):97\u2013126.\n\nLangford, J. and Zhang, T. (2007). The epoch-greedy algorithm for multi-armed bandits with side\ninformation. In Proceedings of Advances in Neural Information Processing Systems 20 (NIPS\u201907).\nLazaric, A. (2011). Transfer in reinforcement learning: a framework and a survey. In Wiering, M.\n\nand van Otterlo, M., editors, Reinforcement Learning: State of the Art. Springer.\n\nLugosi, G., Papaspiliopoulos, O., and Stoltz, G. (2009). Online multi-task learning with hard con-\n\nstraints. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT\u201909).\n\nMann, T. A. and Choe, Y. (2012). Directed exploration in reinforcement learning with trans-\nferred knowledge. In Proceedings of the Tenth European Workshop on Reinforcement Learning\n(EWRL\u201912).\n\nPan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge\n\nand Data Engineering, 22(10):1345\u20131359.\n\nRobbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the AMS,\n\n58:527\u2013535.\n\nSaha, A., Rai, P., Daum\u00e9 III, H., and Venkatasubramanian, S. (2011). Online learning of multiple\ntasks and their relationships. In Proceedings of the 14th International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS\u201911), Ft. Lauderdale, Florida.\n\nStewart, G. W. and Sun, J.-g. (1990). Matrix perturbation theory. Academic press.\nTaylor, M. E. (2009). Transfer in Reinforcement Learning Domains. Springer-Verlag.\nWedin, P. (1972). Perturbation bounds in connection with singular value decomposition. BIT Nu-\n\nmerical Mathematics, 12(1):99\u2013111.\n\n9\n\n\f", "award": [], "sourceid": 1077, "authors": [{"given_name": "Mohammad", "family_name": "Gheshlaghi azar", "institution": "CMU"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "INRIA"}, {"given_name": "Emma", "family_name": "Brunskill", "institution": "CMU"}]}