{"title": "Multi-Task Learning for Contextual Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 4848, "page_last": 4856, "abstract": "Contextual bandits are a form of multi-armed bandit in which the agent has access to predictive side information (known as the context) for each arm at each time step, and have been used to model personalized news recommendation, ad placement, and other applications. In this work, we propose a multi-task learning framework for contextual bandit problems. Like multi-task learning in the batch setting, the goal is to leverage similarities in contexts for different arms so as to improve the agent's ability to predict rewards from contexts. We propose an upper confidence bound-based multi-task learning algorithm for contextual bandits, establish a corresponding regret bound, and interpret this bound to quantify the advantages of learning in the presence of high task (arm) similarity. We also describe an effective scheme for estimating task similarity from data, and demonstrate our algorithm's performance on several data sets.", "full_text": "Multi-Task Learning for Contextual Bandits\n\nAniket Anand Deshmukh\n\nDepartment of EECS\n\nUniversity of Michigan Ann Arbor\n\nAnn Arbor, MI 48105\naniketde@umich.edu\n\nUrun Dogan\n\nMicrosoft Research\n\nCambridge CB1 2FB, UK\nurun.dogan@skype.net\n\nClayton Scott\n\nDepartment of EECS\n\nUniversity of Michigan Ann Arbor\n\nAnn Arbor, MI 48105\nclayscot@umich.edu\n\nAbstract\n\nContextual bandits are a form of multi-armed bandit in which the agent has access\nto predictive side information (known as the context) for each arm at each time step,\nand have been used to model personalized news recommendation, ad placement,\nand other applications. In this work, we propose a multi-task learning framework\nfor contextual bandit problems. Like multi-task learning in the batch setting, the\ngoal is to leverage similarities in contexts for different arms so as to improve the\nagent\u2019s ability to predict rewards from contexts. We propose an upper con\ufb01dence\nbound-based multi-task learning algorithm for contextual bandits, establish a cor-\nresponding regret bound, and interpret this bound to quantify the advantages of\nlearning in the presence of high task (arm) similarity. We also describe an effective\nscheme for estimating task similarity from data, and demonstrate our algorithm\u2019s\nperformance on several data sets.\n\n1\n\nIntroduction\n\nA multi-armed bandit (MAB) problem is a sequential decision making problem where, at each time\nstep, an agent chooses one of several \u201carms,\" and observes some reward for the choice it made. The\nreward for each arm is random according to a \ufb01xed distribution, and the agent\u2019s goal is to maximize\nits cumulative reward [4] through a combination of exploring different arms and exploiting those\narms that have yielded high rewards in the past [15, 11].\nThe contextual bandit problem is an extension of the MAB problem where there is some side\ninformation, called the context, associated to each arm [12]. Each context determines the distribution\nof rewards for the associated arm. The goal in contextual bandits is still to maximize the cumulative\nreward, but now leveraging the contexts to predict the expected reward of each arm. Contextual\nbandits have been employed to model various applications like news article recommendation [7],\ncomputational advertisement [9], website optimization [20] and clinical trials [19]. For example, in\nthe case of news article recommendation, the agent must select a news article to recommend to a\nparticular user. The arms are articles and contextual features are features derived from the article and\nthe user. The reward is based on whether a user reads the recommended article.\nOne common approach to contextual bandits is to \ufb01x the class of policy functions (i.e., functions from\ncontexts to arms) and try to learn the best function with time [13, 18, 16]. Most algorithms estimate\nrewards either separately for each arm, or have one single estimator that is applied to all arms. In\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fcontrast, our approach is to adopt the perspective of multi-task learning (MTL). The intuition is that\nsome arms may be similar to each other, in which case it should be possible to pool the historical\ndata for these arms to estimate the mapping from context to rewards more rapidly. For example, in\nthe case of news article recommendation, there may be thousands of articles, and some of those are\nbound to be similar to each other.\n\nProblem 1 Contextual Bandits\n\nfor t = 1, ..., T do\n\nObserve context xa,t \u2208 Rd for all arms a \u2208 [N ], where [N ] = {1, ...N}\nChoose an arm at \u2208 [N ]\nReceive a reward rat,t \u2208 R\nImprove arm selection strategy based on new observation (xat,t, at, rat,t)\n\nend for\n\nThe contextual bandit problem is formally stated in Problem 1. The total T trial reward is de\ufb01ned as\nt ,t, where rat,t is reward of the selected arm\nat at time t and a\u2217\nt is the arm with maximum reward at trial t. The goal is to \ufb01nd an algorithm that\nminimizes the T trial regret\n\n(cid:80)T\nt=1 rat,t and the optimal T trial reward as(cid:80)T\nT(cid:88)\n\nt=1 ra\u2217\n\nR(T ) =\n\nt ,t \u2212 T(cid:88)\n\nra\u2217\n\nrat,t.\n\nt=1\n\nt=1\n\na,t\u03b8\u2217\n\nWe focus on upper con\ufb01dence bound (UCB) type algorithms for the remainder of the paper. A UCB\nstrategy is a simple way to represent the exploration and exploitation tradeoff. For each arm, there is\nan upper bound on reward, comprised of two terms. The \ufb01rst term is a point estimate of the reward,\nand the second term re\ufb02ects the con\ufb01dence in the reward estimate. The strategy is to select the arm\nwith maximum UCB. The second term dominates when the agent is not con\ufb01dent about its reward\nestimates, which promotes exploration. On the other hand, when all the con\ufb01dence terms are small,\nthe algorithm exploits the best arm(s) [2].\nIn the popular UCB type contextual bandits algorithm called Lin-UCB, the expected reward of an\narm is modeled as a linear function of the context, E[ra,t|xa,t] = xT\na, where ra,t is the reward of\narm a at time t and xa,t is the context of arm a at time t. To select the best arm, one estimates \u03b8a\nfor each arm independently using the data for that particular arm [13]. In the language of multi-task\nlearning, each arm is a task, and Lin-UCB learns each task independently.\nIn the theoretical analysis of the Lin-UCB [7] and its kernelized version Kernel-UCB [18] \u03b8a is\nreplaced by \u03b8, and the goal is to learn one single estimator using data from all the arms. In other\nwords, the data from the different arms are pooled together and viewed as coming from a single task.\nThese two approaches, independent and pooled learning, are two extremes, and reality often lies\nsomewhere in between. In the MTL approach, we seek to pool some tasks together, while learning\nothers independently.\nWe present an algorithm motivated by this idea and call it kernelized multi-task learning UCB\n(KMTL-UCB). Our main contributions are proposing a UCB type multi-task learning algorithm\nfor contextual bandits, established a regret bound and interpreting the bound to reveal the impact\nof increased task similarity, introducing a technique for estimating task similarities on the \ufb02y, and\ndemonstrating the effectiveness of our algorithm on several datasets.\nThis paper is organized as follows. Section 2 describes related work and in Section 3 we propose\na UCB algorithm using multi-task learning. Regret analysis is presented in Section 4, and our\nexperimental \ufb01ndings are reported in Section 5. We conclude in Section 6.\n\n2 Related Work\n\nA UCB strategy is a common approach to quantify the exploration/exploitation tradeoff. At each\ntime step t, and for each arm a, a UCB strategy estimates a reward \u02c6ra,t and a one-sided con\ufb01dence\ninterval above \u02c6ra,t with width \u02c6wa,t. The term ucba,t = \u02c6ra,t + \u02c6wa,t is called the UCB index or just\nUCB. Then at each time step t, the algorithm chooses the arm a with the highest UCB.\n\n2\n\n\fIn contextual bandits, the idea is to view learning the mapping x (cid:55)\u2192 r as a regression problem.\nLin-UCB uses a linear regression model while Kernel-UCB uses a nonlinear regression model drawn\nfrom the reproducing kernel Hilbert space (RKHS) of a symmetric and positive de\ufb01nite (SPD) kernel.\nEither of these two regression models could be applied in either the independent setting or the pooled\nsetting. In the independent setting, the regression function for each arm is estimated separately. This\nwas the approach adopted by Li et al. [13] with a linear model. Regret analysis for both Lin-UCB\nand Kernel-UCB adopted the pooled setting [7, 18]. Kernel-UCB in the independent setting has not\npreviously been considered to our knowledge, although the algorithm would just be a kernelized\nversion of Li et al. [13]. We will propose a methodology that extends the above four combinations\nof setting (independent and pooled) and regression model (linear and nonlinear). Gaussian Process\nUCB (GP-UCB) uses a Gaussian prior on the regression function and is a Bayesian equivalent of\nKernel-UCB [16].\nThere are some contextual bandit setups that incorporate multi-task learning. In Lin-UCB with Hybrid\nLinear Models the estimated reward consists of two linear terms, one that is arm-speci\ufb01c and another\nthat is common to all arms [13]. Gang of bandits [5] uses a graph structure (e.g., a social network) to\ntransfer the learning from one user to other for personalized recommendation. Collaborative \ufb01ltering\nbandits [14] is a similar technique which clusters the users based on context. Contextual Gaussian\nProcess UCB (CGP-UCB) builds on GP-UCB and has many elements in common with our framework\n[10]. We defer a more detailed comparison to CGP-UCB until later.\n\n3 KMTL-UCB\n\nWe propose an alternate regression model that includes the independent and pooled settings as special\ncases. Our approach is inspired by work on transfer and multi-task learning in the batch setting\n[3, 8]. Intuitively, if two arms (tasks) are similar, we can pool the data for those arms to train better\npredictors for both.\nFormally, we consider regression functions of the form\nf : \u02dcX (cid:55)\u2192 Y\n\nwhere \u02dcX = Z \u00d7 X , and Z is what we call the task similarity space, X is the context space and\nY \u2286 R is the reward space. Every context xa \u2208 X is associated with an arm descriptor za \u2208 Z, and\nwe de\ufb01ne \u02dcxa = (za, xa) to be the augmented context. Intuitively, za is a variable that can be used to\ndetermine the similarity between different arms. Examples of Z and za will be given below.\nLet \u02dck be a SPD kernel on \u02dcX. In this work we focus on kernels of the form\n\n(cid:16)\n\n(cid:17)\n\n(z, x), (z(cid:48), x(cid:48))\n\n\u02dck\n\n= kZ (z, z(cid:48))kX (x, x(cid:48)),\n\n(1)\n\nwhere kX is a SPD kernel on X , such as linear or Gaussian kernel if X = Rd, and kZ is a kernel on\nZ (examples given below). Let H\u02dck be the RKHS of functions f : \u02dcX (cid:55)\u2192 R associated to \u02dck. Note that\na product kernel is just one option for \u02dck, and other forms may be worth exploring.\n\n3.1 Upper Con\ufb01dence Bound\n\nInstead of learning regression estimates for each arm separately, we effectively learn regression\nestimates for all arms at once by using all the available training data. Let N be the total number\nof distinct arms that algorithm has to choose from. De\ufb01ne [N ] = {1, ..., N} and let the observed\na up to and including time t so that(cid:80)N\ncontexts at time t be xa,t,\u2200a \u2208 [N ]. Let na,t be the number of times the algorithm has selected arm\na=1 na,t = t. De\ufb01ne sets ta = {\u03c4 < t : a\u03c4 = a}, where a\u03c4 is\nthe arm selected at time \u03c4. Notice that |ta| = na,t\u22121 for all a. We solve the following problem at\nN(cid:88)\ntime t:\n\n(cid:88)\n\n1\n\n\u02c6ft = arg min\nf\u2208H\u02dck\n\n1\nN\n\nna,t\u22121\n\n\u03c4\u2208ta\n\na=1\n\n(f (\u02dcxa,\u03c4 ) \u2212 ra,\u03c4 )2 + \u03bb(cid:107)f(cid:107)2H\u02dck\n\n,\n\n(2)\n\nwhere \u02dcxa,\u03c4 is the augmented context of arm a at time \u03c4, and ra,\u03c4 is the reward of an arm a selected at\ntime \u03c4. This problem (2) is a variant of kernel ridge regression. Applying the representer theorem [17]\n\n3\n\n\fthe optimal f can be expressed as f =(cid:80)N\n\na(cid:48)=1\n\n(cid:80)\n\u03c4(cid:48)\u2208ta(cid:48) \u03b1a(cid:48),\u03c4(cid:48) \u02dck(\u00b7, \u02dcxa(cid:48),\u03c4(cid:48)), which yields the solution\n\n(detailed derivation is in the supplementary material)\n\n\u02c6ft(\u02dcx) = \u02dckt\u22121(\u02dcx)T (\u03b7t\u22121 \u02dcKt\u22121 + \u03bbI)\u22121\u03b7t\u22121yt\u22121,\n\nwhere \u02dcKt\u22121 is the (t \u2212 1) \u00d7 (t \u2212 1) kernel matrix on the augmented data [\u02dcxa\u03c4 ,\u03c4 ]t\u22121\n[\u02dck(\u02dcx, \u02dcxa\u03c4 ,\u03c4 )]t\u22121\nall observed rewards, and \u03b7t\u22121 is the (t \u2212 1) \u00d7 (t \u2212 1) diagonal matrix \u03b7t\u22121 = diag[\n\n(3)\n\u03c4 =1, \u02dckt\u22121(\u02dcx) =\n\u03c4 =1 is a vector of kernel evaluations between \u02dcx and the past data, yt\u22121 = [ra\u03c4 ,\u03c4 ]t\u22121\n\u03c4 =1 are\n]t\u22121\n\u03c4 =1.\n\n1\n\nna\u03c4,t\u22121\n\nWhen \u02dcx = \u02dcxa,t, we write \u02dcka,t = \u02dckt\u22121(\u02dcxa,t). With only minor modi\ufb01cations to the argument in Valko\net al [18], we have the following:\nLemma 1. Suppose the rewards [ra\u03c4 ,\u03c4 ]T\nE[ra\u03c4 ,\u03c4|\u02dcxa\u03c4 ,\u03c4 ] = f\u2217(\u02dcxa\u03c4 ,\u03c4 ), where f\u2217 \u2208 H\u02dck and (cid:107)f\u2217(cid:107)H\u02dck\nWith probability at least 1 \u2212 \u03b4\n\n\u03c4 =1 are independent random variables with means\nand \u03b4 > 0.\n\n(cid:113) log(2T N/\u03b4)\n\n\u2264 c. Let \u03b1 =\n\nT , we have that \u2200a \u2208 [N ]\n\n2\n\n(cid:113)\n\n| \u02c6ft(\u02dcxa,t) \u2212 f\u2217(\u02dcxa,t)| \u2264 wa,t := (\u03b1 + c\na,t(\u03b7t\u22121 \u02dcKt\u22121 + \u03bbI)\u22121\u03b7t\u22121\n\n\u02dck(\u02dcxa,t, \u02dcxa,t) \u2212 \u02dckT\n\n\u02dcka,t.\n\n\u221a\n\nwhere sa,t = \u03bb\u22121/2\n\n\u03bb)sa,t\n\n(4)\n\nThe result in Lemma 1 motivates the UCB\n\nucba,t = \u02c6ft(\u02dcxa,t) + wa,t\n\nand inspires Algorithm 1.\n\nAlgorithm 1 KMTL-UCB\n\nInput: \u03b2 \u2208 R+,\nfor t = 1, ..., T do\n\nUpdate the (product) kernel matrix \u02dcKt\u22121 and \u03b7t\u22121\nObserve context features at time t: xa,t for each a \u2208 [N ].\nDetermine arm descriptor za for each a \u2208 [N ] to get augmented context \u02dcxa,t.\nfor all a at time t do\n\npa,t \u2190 \u02c6ft(\u02dcxa,t) + \u03b2sa,t\n\nend for\nChoose arm at = arg max pa,t, observe a real valued payoff rat,t and update yt .\nOutput: at\n\nend for\n\nBefore an arm has been selected at least once, \u02c6ft(\u02dcxa,t) and the second term in sa,t,\na,t(\u03b7t\u22121 \u02dcKt\u22121 + \u03bbI)\u22121\u03b7t\u22121\n\u02dckT\nterm of sa,t, i.e.,\n\ni.e.,\n\u02dcka,t, are taken to be 0. In that case, the algorithm only uses the \ufb01rst\n\n\u02dck(\u02dcxa,t, \u02dcxa,t), to form the UCB.\n\n(cid:113)\n\n3.2 Choice of Task Similarity Space and Kernel\nTo illustrate the \ufb02exibility of our framework, we present the following three options for Z and kZ:\n1. Independent: Z = {1, ..., N}, kZ (a, a(cid:48)) = 1a=a(cid:48). The augmented context for a context xa\n2. Pooled: Z = {1}, kZ \u2261 1. The augmented context for a context xa for arm a is just (1, xa).\n3. Multi-Task: Z = {1, ..., N} and kZ is a PSD matrix re\ufb02ecting arm/task similarities. If this\n\nfrom arm a is just (a, xa).\n\nmatrix is unknown, it can be estimated as discussed below.\n\nAlgorithm 1 with the \ufb01rst two choices specializes to the independent and pooled settings mentioned\npreviously. In either setting, choosing a linear kernel for kX leads to Lin-UCB, while a more general\nkernel essentially gives rise to Kernel-UCB. We will argue that the multi-task setting facilitates\nlearning when there is high task similarity.\n\n4\n\n\fWe also introduce a fourth option for Z and kZ that allows task similarity to be estimated when it is\nunknown. In particular, we are inspired by the kernel transfer learning framework of Blanchard et al.\n[3]. Thus, we de\ufb01ne the arm similarity space to be Z = PX , the set of all probability distributions\non X . We further assume that contexts for arm a are drawn from probability measure Pa. Given a\ncontext xa for arm a, we de\ufb01ne its augmented context to be (Pa, xa).\nTo de\ufb01ne a kernel on Z = PX , we use the same construction described in [3], originally introduced\nby Steinwart and Christmann [6]. In particular, in our experiments we use a Gaussian-like kernel\n\nwhere \u03a8(P ) =(cid:82) k(cid:48)\nWe may estimate \u03a8(Pa) via \u03a8((cid:98)Pa) = 1\n\nde\ufb01ned by yet another SPD kernel k(cid:48)\n\nkZ (Pa, Pa(cid:48)) = exp(\u2212(cid:107)\u03a8(Pa) \u2212 \u03a8(Pa(cid:48))(cid:107)2/2\u03c32Z ),\n\n(5)\nX (\u00b7, x)dP x is the kernel mean embedding of a distribution P . This embedding is\nX on X , which could be different from the kX used to de\ufb01ne \u02dck.\nX (\u00b7, xa\u03c4 ,\u03c4 ), which leads to an estimate of kZ.\nk(cid:48)\n\n(cid:80)\n\n\u03c4\u2208ta\n\nna,t\u22121\n\n4 Theoretical Analysis\n\nTo simplify the analysis we consider a modi\ufb01ed version of the original problem 2:\n\nN(cid:88)\n\n(cid:88)\n\na=1\n\n\u03c4\u2208ta\n\n\u02c6ft = arg min\nf\u2208H\u02dck\n\n1\nN\n\n(f (\u02dcxa,\u03c4 ) \u2212 ra,\u03c4 )2 + \u03bb(cid:107)f(cid:107)2H\u02dck\n\n.\n\n(6)\n\n1\n\nna,t\u22121\n\n(cid:113)\n\n\u02dck(\u02dcxa,t, \u02dcxa,t) \u2212 \u02dckT\n\nas they obscure the analysis. In practice,\n\nIn particular, this modi\ufb01ed problem omits the terms\nthese terms should be incorporated.\nIn this case sa,t = \u03bb\u22121/2\na,t( \u02dcKt\u22121 + \u03bbI)\u22121\u02dcka,t. Under this assumption Kernel-\nUCB is exactly KMTL-UCB with kZ \u2261 1. On the other hand, KMTL-UCB can be viewed as\na special case of Kernel-UCB on the augmented context space \u02dcX . Thus, the regret analysis of\nKernel-UCB applies to KMTL-UCB, but it does not reveal the potential gains of multi-task learning.\nWe present an interpretable regret bound that reveals the bene\ufb01ts of MTL. We also establish a lower\nbound on the UCB width that decreases as task similarity increases (presented in the supplementary\n\ufb01le).\n\n4.1 Analysis of SupKMTL-UCB\n\nIt is not trivial to analyze algorithm 1 because the reward at time t is dependent on the past rewards.\nWe follow the same strategy originally proposed in [1] and used in [7, 18] which uses SupKMTL-UCB\nas a master algorithm, and BaseKMTL-UCB (which is called by SupKMTL-UCB) to get estimates of\nreward and width. SupKMTL-UCB builds mutually exclusive subsets of [T ] such that rewards in any\nsubset are independent. This guarantees that the independence assumption of Lemma 1 is satis\ufb01ed.\nWe describe these algorithms in a supplementary section because of space constraints.\nTheorem 1. Assume that ra,t \u2208 [0, 1],\u2200a \u2208 [N ], T \u2265 1, (cid:107)f\u2217(cid:107)H\u02dck\n\u2264 c, \u02dck(\u02dcx, \u02dcx) \u2264 c\u02dck,\u2200\u02dcx \u2208 \u02dcX and\nthe task similarity matrix KZ is known. With probability at least 1 \u2212 \u03b4, SupKMTL-UCB satis\ufb01es\n\n(cid:17)\n\n(cid:33)(cid:112)2m log g([T ])(cid:112)T(cid:100)log(T )(cid:101)\n\n\u221a\n+ c\n\n\u03bb\n\n2T N (log(T ) + 1)/\u03b4\n\n2\n\n\u221a\n\n(cid:32)(cid:118)(cid:117)(cid:117)(cid:116) log\n(cid:16)\n(cid:16)(cid:112)T log(g([T ]))\n(cid:17)\n\nT + 10\n\nR(T ) \u2264 2\n\n= O\n\nwhere g([T ]) = det( \u02dcKT +1+\u03bbI)\nNote that this theorem assumes that task similarity is known. In the experiments for real datasets\nusing the approach discussed in subsection 3.2 we estimate the task similarity from the available data.\n\nand m = max(1, c\u02dck\n\n\u03bb ).\n\n\u03bbT +1\n\n4.2\n\nInterpretation of Regret Bound\n\nThe following theorems help us interpret the regret bound by looking at\n\nT +1(cid:89)\n\nt=1\n\n(\u03bbt + \u03bb)\n\n\u03bb\n\n,\n\ng([T ]) =\n\ndet( \u02dcKT +1 + \u03bbI)\n\n\u03bbT +1\n\n=\n\n5\n\n\fwhere, \u03bb1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbT +1 are the eigenvalues of the kernel matrix \u02dcKT +1.\nAs mentioned above, the regret bound of Kernel-UCB applies to our method, and we are able to\nrecover this bound as a corollary of Theorem 1. In the case of Kernel-UCB \u02dcKt = KXt,\u2200t \u2208 [T ] as\nall arm estimators are assumed to be the same. We de\ufb01ne the effective rank of \u02dcKT +1 in the same way\nas [18] de\ufb01nes the effective dimension of the kernel feature space.\n\nDe\ufb01nition 1. The effective rank of \u02dcKT +1 is de\ufb01ned to be r := min{j : j\u03bb log T \u2265(cid:80)T +1\n\ni=j+1 \u03bbi}.\n\nIn the following result, the notation \u02dcO hides logarithmic terms.\nCorollary 1. log(g([T ])) \u2264 r log\n\n2T 2(T +1)c\u02dck+r\u03bb\u2212r\u03bb log T\n\nr\u03bb\n\n\u221a\n, and therefore R(T ) = \u02dcO(\n\nrT )\n\n(cid:16)\n\n(cid:17)\n\nHowever, beyond recovering a known bound, Theorem 1 can also be interpreted to reveal the\npotential gains of multi-task learning. To interpret the regret bound in Theorem 1, we make a\nN for all a \u2208 [N ]. For simplicity de\ufb01ne nt = na,t.\nfurther assumption that after time t, na,t = t\nLet ((cid:12)) denote the Hadamard product, (\u2297) denote the Kronecker product and 1n \u2208 Rn be the\n\u03c4,\u03c4(cid:48)=1 be the t \u00d7 t kernel matrix on contexts,\nvector of ones. Let KXt = [kX (xa\u03c4 ,\u03c4 , xa\u03c4(cid:48) ,\u03c4(cid:48))]t\n\u03c4,\u03c4(cid:48)=1 be the associated t \u00d7 t kernel matrix based on arm similarity, and\nKZt = [kZ (za\u03c4 , za\u03c4(cid:48) )]t\na=1 be the N \u00d7 N arm/task similarity matrix between N arms, where xa\u03c4 ,\u03c4\nKZ = [kZ (za, za)]N\nis the observed context and za\u03c4 is the associated arm descriptor. Using eqn. (1), we can write\n\u02dcKt = KZt (cid:12) KXt. We rearrange the sequence of xa\u03c4 ,\u03c4 to get [xa,\u03c4 ]N\nsuch that elements\n(a\u22121)nt to ant belong to arm a. De\ufb01ne \u02dcK r\nto be the rearranged kernel matrices based\n) (cid:12) K r\non the re-ordered set [xa,\u03c4 ]N\nand the eigenvalues \u03bb( \u02dcKt) and \u03bb( \u02dcK r\n\nt , K r\nXt\n. Notice that we can write \u02dcK r\nt ) are equal. To summarize, we have\n\nt = (KZ \u2297 1nt\n\nand K r\nZt\n\na=1,\u03c4 =(t+1)a\n\na=1,\u03c4 =(t+1)a\n\n1T\nnt\n\nXt\n\n(cid:16)\n\n\u02dcKt = KZt (cid:12) KXt\n(KZ \u2297 1nt\n\n(cid:17)\n\n.\n\n) (cid:12) K r\n\n(7)\nTheorem 2. Let the rank of matrix KXT +1 be rx and the rank of matrix KZ be rz. Then\nlog(g([T ])) \u2264 rzrx log\n\n(cid:16) (T +1)c\u02dck+\u03bb\n\n\u03bb( \u02dcKt) = \u03bb\n\n(cid:17)\n\n1T\nnt\n\nXt\n\n\u03bb\n\nN dT ).\n\nThis means that when the rank of the task similarity matrix is low, which re\ufb02ects a high degree\nof inter-task similarity, the regret bound is tighter. For comparison, note that when all tasks are\nindependent, rz = N and when all tasks are the same (pooled), then rz = 1. In the case of Lin-\n\u221a\nUCB [7] where all arm estimators are assumed to be the same and kX is a linear kernel, the regret\nbound in Theorem 1 evaluates to \u02dcO(\ndT ), where d is the dimension of the context space. In the\n\u221a\noriginal Lin-UCB algorithm [13] where all arm estimators are different, the regret bound would be\n\u02dcO(\nWe can further comment on g([T ]) when all distinct tasks (arms) are similar to each other with\ntask similarity equal to \u00b5. Thus de\ufb01ne KZ(\u00b5) := (1 \u2212 \u00b5)IN + \u00b51N 1T\nt (\u00b5) = (KZ(\u00b5) \u2297\n1nt\nTheorem 3. Let g\u00b5([T ]) =\nThis shows that when there is more task similarity, the regret bound is tighter.\n\n. If \u00b51 \u2264 \u00b52 then g\u00b51 ([T ]) \u2265 g\u00b52 ([T ]).\n\nT +1(\u00b5)+\u03bbI)\n\u03bbT +1\n\nN and \u02dcK r\n\n) (cid:12) K r\n\ndet( \u02dcKr\n\n1T\nnt\n\nXt\n\n.\n\n4.3 Comparison with CGP-UCB\n\nCGP-UCB transfers the learning from one task to another by leveraging additional known task-\nspeci\ufb01c context variables [10], similar in spirit to KTML-UCB. Indeed, with slight modi\ufb01cations,\nKMTL-UCB can be viewed as a frequentist analogue of CGP-UCB, and similarly CGP-UCB could\nbe modi\ufb01ed to address our setting. Furthermore, the term g([T ]) appearing in our regret bound is\nequivalent to an information gain term used to analyze CGP-UCB. In the agnostic case of CGP-\n\u221a\nUCB where there is no assumption of a Gaussian prior on decision functions, their regret bound is\nT ), while their regret bound matches ours when they adopt a GP prior on f\u2217. Thus,\nO(log(g([T ]))\nour primary contributions with respect to CGP-UCB are to provide a tighter regret bound in agnostic\ncase, and a technique for estimating task similarity which is critical for real-world applications.\n\n6\n\n\f5 Experiments\n\nWe test our algorithm on synthetic data and some multi-class classi\ufb01cation datasets. In the case of\nmulti-class datasets, the number of arms N is the number of classes and the reward is 1 if we predict\nthe correct class, otherwise it is 0. We separate the data into two parts - validation set and test set.\nWe use all Gaussian kernels and pre-select the bandwidth of kernels using \ufb01ve fold cross-validation\non a holdout validation set and we use \u03b2 = 0.1 for all experiments. Then we run the algorithm on\nthe test set 10 times (with different sequences of streaming data) and report the mean regret. For the\nsynthetic data, we compare Kernel-UCB in the independent setting (Kernel-UCB-Ind) and pooled\nsetting (Kernel-UCB-Pool), KMTL-UCB with known task similarity, and KMTL-UCB-Est which\nestimates task similarity on the \ufb02y. For the real datasets in the multi-class classi\ufb01cation setting, we\ncompare Kernel-UCB-Ind and KMTL-UCB-Est. In this case, the pooled setting is not valid because\nxa,t is the same for all arms (only za differs) and KMTL-UCB is not valid because the task similarity\nmatrix is unknown. We also report the con\ufb01dence intervals for these results in the supplementary\nmaterial.\n\n5.1 Synthetic News Article Data\n\nSuppose an agent has access to a pool of articles and their context features. The agent then sees a\nuser along with his/her features for which it needs to recommend an article. Based on user features\nand article features the algorithm gets a combined context xa,t. The user context xu,t \u2208 R2,\u2200t is\nrandomly drawn from an ellipse centered at (0, 0) with major axis length 1 and minor axis length 0.5.\nLet xu,t[:, 1] be the minor axis and xu,t[:, 2] be the major axis. Article context xart,t is any angle \u03b8 \u2208\n2 ]. To get the overall summary xa,t of user and article the user context xu,t is rotated with xart,t.\n[0, \u03c0\n.\nRewards for each article are de\ufb01ned based on the minor axis ra,t =\n\n1.0 \u2212 (xu,t[:, 1] \u2212 a\n\n(cid:16)\n\nN + 0.5)2(cid:17)\n\nFigure 1: Synthetic Data\n\nFigure 1 shows one such example for 4 different arms. The color code describes the reward, the two\naxes show the information about user context, and theta is the article context. We take N = 5. For\nKMTL-UCB, we use a Gaussian kernel on xart,t to get the task similarity.\nThe results of this experiment are shown in Figure 1. As one can see, Kernel-UCB-Pool performs the\nworst. That means for this setting combining all the data and learning a single estimator is not ef\ufb01cient.\nKMTL-UCB beats the other methods in all 10 runs, and Kernel-UCB-Ind and KMTL-UCB-Est\nperform equally well.\n\n5.2 Multi-class Datasets\n\nIn the case of multi-class classi\ufb01cation, each class is an arm and the features of an example for which\nthe algorithm needs to recommend a class are the contexts. We consider the following datasets:\nDigits (N = 10, d = 64), Letter (N = 26, d = 16), MNIST (N = 10, d = 780 ), Pendigits\n(N = 10, d = 16), Segment (N = 7, d = 19) and USPS (N = 10, d = 256). Empirical mean regrets\nare shown in Figure 2. KMTL-UCB-Est performs the best in three of the datasets and performs\nequally well in the other three datasets. Figure 3 shows the estimated task similarity (re-ordered\n\n7\n\n\fto reveal block structure) and one can see the effect of the estimated task similarity matrix on the\nempirical regret in Figure 2. For the Digits, Segment and MNIST datasets, there is signi\ufb01cant\ninter-task similarity. For Digits and Segment datasets, KMTL-UCB-Est is the best in all 10 runs of\nthe experiment while for MNIST, KMTL-UCB-Est is better for all but 1 run.\n\nFigure 2: Results on Multiclass Datasets - Empirical Mean Regret\n\nFigure 3: Estimated Task Similarity for Real Datasets\n\n6 Conclusions and future work\n\nWe present a multi-task learning framework in the contextual bandit setting and describe a way to\nestimate task similarity when it is not given. We give theoretical analysis, interpret the regret bound,\nand support the theoretical analysis with extensive experiments. In the supplementary material we\nestablish a lower bound on the UCB width, and argue that it decreases as task similarity increases.\nOur proposal to estimate the task similarity matrix using the arm similarity space Z = PX can be\nextended in different ways. For example, we could also incorporate previously observed rewards\ninto Z. This would alleviate a potential problem with our approach, namely, that some contexts\nmay have been selected when they did not yield a high reward. Additionally, by estimating the task\nsimilarity matrix, we are estimating arm-speci\ufb01c information. In the case of multiclass classi\ufb01cation,\nkZ re\ufb02ects information that represents the various classes. A natural extension is to incorporate\nmethods for representation learning into the MTL bandit setting.\n\n8\n\n\fReferences\n[1] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422, 2002.\n\n[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\n[3] G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classi\ufb01cation tasks to a\nnew unlabeled sample. In Advances in neural information processing systems, pages 2178\u20132186,\n2011.\n\n[4] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Machine Learning, 5(1):1\u2013122, 2012.\n\n[5] N. Cesa-Bianchi, C. Gentile, and G. Zappella. A gang of bandits. In Advances in Neural\n\nInformation Processing Systems, pages 737\u2013745, 2013.\n\n[6] A. Christmann and I. Steinwart. Universal kernels on non-standard input spaces. In Advances\n\nin neural information processing systems, pages 406\u2013414, 2010.\n\n[7] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions.\n\n[8] T. Evgeniou and M. Pontil. Regularized multi\u2013task learning. In Proceedings of the tenth ACM\nSIGKDD international conference on Knowledge discovery and data mining, pages 109\u2013117.\nACM, 2004.\n\n[9] S. Kale, L. Reyzin, and R. E. Schapire. Non-stochastic bandit slate problems. In Advances in\n\nNeural Information Processing Systems, pages 1054\u20131062, 2010.\n\n[10] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Advances in\n\nNeural Information Processing Systems, pages 2447\u20132455, 2011.\n\n[11] V. Kuleshov and D. Precup. Algorithms for multi-armed bandit problems. arXiv preprint\n\narXiv:1402.6028, 2014.\n\n[12] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. In Advances in neural information processing systems, pages 817\u2013824, 2008.\n\n[13] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized\nnews article recommendation. In Proceedings of the 19th international conference on World\nwide web, pages 661\u2013670. ACM, 2010.\n\n[14] S. Li, A. Karatzoglou, and C. Gentile. Collaborative \ufb01ltering bandits. In Proceedings of the 39th\nInternational ACM SIGIR conference on Research and Development in Information Retrieval,\npages 539\u2013548. ACM, 2016.\n\n[15] H. Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins Selected\n\nPapers, pages 169\u2013177. Springer, 1985.\n\n[16] N. Srinivas, A. Krause, M. Seeger, and S. M. Kakade. Gaussian process optimization in the\nbandit setting: No regret and experimental design. In Proceedings of the 27th International\nConference on Machine Learning (ICML-10), pages 1015\u20131022, 2010.\n\n[17] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media,\n\n2008.\n\n[18] M. Valko, N. Korda, R. Munos, I. Flaounas, and N. Cristianini. Finite-time analysis of kernelised\n\ncontextual bandits. In Uncertainty in Arti\ufb01cial Intelligence, page 654. Citeseer, 2013.\n\n[19] S. S. Villar, J. Bowden, and J. Wason. Multi-armed bandit models for the optimal design of\nclinical trials: bene\ufb01ts and challenges. Statistical science: a review journal of the Institute of\nMathematical Statistics, 30(2):199, 2015.\n\n[20] J. White. Bandit algorithms for website optimization. \" O\u2019Reilly Media, Inc.\", 2012.\n\n9\n\n\f", "award": [], "sourceid": 2514, "authors": [{"given_name": "Aniket Anand", "family_name": "Deshmukh", "institution": "University of Michigan, Ann Arbor"}, {"given_name": "Urun", "family_name": "Dogan", "institution": "Microsoft"}, {"given_name": "Clay", "family_name": "Scott", "institution": "University of Michigan"}]}