{"title": "Discovering, Learning and Exploiting Relevance", "book": "Advances in Neural Information Processing Systems", "page_first": 1233, "page_last": 1241, "abstract": "In this paper we consider the problem of learning online what is the information to consider when making sequential decisions. We formalize this as a contextual multi-armed bandit problem where a high dimensional ($D$-dimensional) context vector arrives to a learner which needs to select an action to maximize its expected reward at each time step. Each dimension of the context vector is called a type. We assume that there exists an unknown relation between actions and types, called the relevance relation, such that the reward of an action only depends on the contexts of the relevant types. When the relation is a function, i.e., the reward of an action only depends on the context of a single type, and the expected reward of an action is Lipschitz continuous in the context of its relevant type, we propose an algorithm that achieves $\\tilde{O}(T^{\\gamma})$ regret with a high probability, where $\\gamma=2/(1+\\sqrt{2})$. Our algorithm achieves this by learning the unknown relevance relation, whereas prior contextual bandit algorithms that do not exploit the existence of a relevance relation will have $\\tilde{O}(T^{(D+1)/(D+2)})$ regret. Our algorithm alternates between exploring and exploiting, it does not require reward observations in exploitations, and it guarantees with a high probability that actions with suboptimality greater than $\\epsilon$ are never selected in exploitations. Our proposed method can be applied to a variety of learning applications including medical diagnosis, recommender systems, popularity prediction from social networks, network security etc., where at each instance of time vast amounts of different types of information are available to the decision maker, but the effect of an action depends only on a single type.", "full_text": "Discovering, Learning and Exploiting Relevance\n\nCem Tekin\n\nElectrical Engineering Department\nUniversity of California Los Angeles\n\ncmtkn@ucla.edu\n\nMihaela van der Schaar\n\nElectrical Engineering Department\nUniversity of California Los Angeles\n\nmihaela@ee.ucla.edu\n\nAbstract\n\nIn this paper we consider the problem of learning online what is the information\nto consider when making sequential decisions. We formalize this as a contextual\nmulti-armed bandit problem where a high dimensional (D-dimensional) context\nvector arrives to a learner which needs to select an action to maximize its expected\nreward at each time step. Each dimension of the context vector is called a type. We\nassume that there exists an unknown relation between actions and types, called the\nrelevance relation, such that the reward of an action only depends on the contexts\nof the relevant types. When the relation is a function, i.e., the reward of an action\nonly depends on the context of a single type, and the expected reward of an action\nis Lipschitz continuous in the context of its relevant type, we propose an algo-\nrithm that achieves \u02dcO(T \u03b3) regret with a high probability, where \u03b3 = 2/(1 +\n2).\nOur algorithm achieves this by learning the unknown relevance relation, whereas\nprior contextual bandit algorithms that do not exploit the existence of a relevance\nrelation will have \u02dcO(T (D+1)/(D+2)) regret. Our algorithm alternates between ex-\nploring and exploiting, it does not require reward observations in exploitations,\nand it guarantees with a high probability that actions with suboptimality greater\nthan \u0001 are never selected in exploitations. Our proposed method can be applied to\na variety of learning applications including medical diagnosis, recommender sys-\ntems, popularity prediction from social networks, network security etc., where at\neach instance of time vast amounts of different types of information are available\nto the decision maker, but the effect of an action depends only on a single type.\n\n\u221a\n\n1\n\nIntroduction\n\nIn numerous learning problems the decision maker is provided with vast amounts of different types\nof information which it can utilize to learn how to select actions that lead to high rewards. The\nvalue of each type of information can be regarded as the context on which the learner acts, hence\nall the information can be encoded in a context vector. We focus on problems where this context\nvector is high dimensional but the reward of an action only depends on a small subset of types. This\ndependence is given in terms of a relation between actions and types, which is called the relevance\nrelation. For an action set A and a type set D, the relevance relation is given by R = {R(a)}a\u2208A,\nwhere R(a) \u2282 D. Expected reward of an action a only depends on the values of the relevant\ntypes of contexts. Hence, for a context vector x, action a\u2019s expected reward is equal to \u00b5(a, xR(a)),\nwhere xR(a) is the context vector corresponding to the types in R(a). Several examples of relevance\nrelations and their effect on expected action rewards are given in Fig. 1. The problem of \ufb01nding\nthe relevance relation is important especially when maxa\u2208A |R(a)| << |D|.1\nIn this paper we\nconsider the case when the relevance relation is a function, i.e., |R(a)| = 1, for all a \u2208 A, which is\nan important special case. We discuss the extension of our framework to the more general case in\nSection 3.3.\n\n1For a set A, |A| denotes its cardinality.\n\n1\n\n\fFigure 1: Examples of relevance relations: (i) general relevance relation, (ii) linear relevance rela-\ntion, (iii) relevance function. In this paper we only consider (iii), while our methods can easily be\ngeneralized to (i) and (ii).\n\nRelevance relations exists naturally in many practical applications. For example, when sequentially\ntreating patients with a particular disease, many types of information (contexts) are usually available\n- the patients\u2019 age, weight, blood tests, scans, medical history etc. If a drug\u2019s effect on a patient is\ncaused by only one of the types, then learning the relevant type for the drug will result in signi\ufb01cantly\nfaster learning for the effectiveness of the drug for the patients.2 Another example is recommender\nsystems, where recommendations are made based on the high dimensional information obtained\nfrom the browsing and purchase histories of the users. A user\u2019s response to a product recommen-\ndation will depend on the user\u2019s gender, occupation, history of past purchases etc., while his/her\nresponse to other product recommendations may depend on completely different information about\nthe user such as the age and home address.\nTraditional contextual bandit solutions disregard existence of such relations, hence have regret\nbounds that scale exponentially with the dimension of the context vector [1, 2]. In order to solve the\ncurse of dimensionality problem, a new approach which learns the relevance relation in an online\nway is required. The algorithm we propose simultaneously learns the relevance relation (when it is a\nfunction) and the action rewards by comparing sample mean rewards of each action for context pairs\nof different types that are calculated based on the context and reward observations so far. The only\nassumption we make about actions and contexts is the Lipschitz continuity of expected reward of an\naction in the context of its relevant type. Our main contributions can be summarized as follows:\n\n\u2022 We propose the Online Relevance Learning with Controlled Feedback (ORL-CF) algorithm\nthat alternates between exploration and exploitation phases, which achieves a regret bound\nof \u02dcO(T \u03b3),3 with \u03b3 = 2/(1 +\n\n2), when the relevance relation is a function.\n\n\u221a\n\n\u2022 We derive separate bounds on the regret incurred in exploration and exploitation phases.\nORL-CF only needs to observe the reward in exploration phases, hence the reward feedback\nis controlled. ORL-CF achieves the same time order of regret even when observing the\nreward has a non-zero cost.\n\n\u2022 Given any \u03b4 > 0, which is an input to ORL-CF, suboptimal actions will never be selected\nin exploitation steps with probability at least 1 \u2212 \u03b4. This is very important, perhaps vital in\nnumerous applications where the performance needs to be guaranteed, such as healthcare.\n\nDue to the limited space, numerical results on the performance of our proposed algorithm is included\nin the supplementary material.\n\n2Even when there are multiple relevant types for each action, but there is one dominant type whose effect\non the reward of the action is signi\ufb01cantly larger than the effects of other types, assuming that the relevance\nrelation is a function will be a good approximation.\n3O(\u00b7) is the Big O notation, \u02dcO(\u00b7) is the same as O(\u00b7) except it hides terms that have polylogarithmic growth.\n\n2\n\n\f2 Problem Formulation\nA is the set of actions, D is the dimension of the context vector, D := {1, 2, . . . , D} is the set of\ntypes, and R = {R(a)}a\u2208A : A \u2192 D is the relevance function, which maps every a \u2208 A to a\nunique d \u2208 D. At each time step t = 1, 2, . . ., a context vector xt arrives to the learner. After\nobserving xt the learner selects an action a \u2208 A, which results in a random reward rt(a, xt). The\nlearner may choose to observe this reward by paying cost cO \u2265 0. The goal of the learner is to\nmaximize the sum of the generated rewards minus costs of observations for any time horizon T .\nEach xt consists of D types of contexts, and can be written as xt = (x1,t, x2,t, . . . , xD,t) where xi,t\nis called the type i context. Xi denotes the space of type i contexts and X := X1 \u00d7 X2 \u00d7 . . . \u00d7 XD\ndenotes the space of context vectors. At any t, we have xi,t \u2208 Xi for all i \u2208 D. For the sake of\nnotational simplicity we take Xi = [0, 1] for all i \u2208 D, but all our results can be generalized to\nthe case when Xi is a bounded subset of the real line. For x = (x1, x2, . . . , xD) \u2208 X , rt(a, x)\nis generated according to an i.i.d. process with distribution F (a, xR(a)) with support in [0, 1] and\nexpected value \u00b5(a, xR(a)).\nThe following assumption gives a similarity structure between the expected reward of an action and\nthe contexts of the type that is relevant to that action.\nR(a)|,\nAssumption 1. For all a \u2208 A, x, x(cid:48) \u2208 X , we have |\u00b5(a, xR(a))\u2212\u00b5(a, x(cid:48)\nwhere L > 0 is the Lipschitz constant.\n\nR(a))| \u2264 L|xR(a)\u2212x(cid:48)\n\nWe assume that the learner knows the L given in Assumption 1. This is a natural assumption in\ncontextual bandit problems [1, 2]. Given a context vector x = (x1, x2, . . . , xD), the optimal action\nis a\u2217(x) := arg maxa\u2208A \u00b5(a, xR(a)), but the learner does not know it since it does not know R,\nF (a, xR(a)) and \u00b5(a, xR(a)) for a \u2208 A, x \u2208 X a priori. In order to assess the learner\u2019s loss due to\nunknowns, we compare its performance with the performance of an oracle benchmark which knows\na\u2217(x) for all x \u2208 X . Let \u00b5t(a) := \u00b5(a, xR(a),t). The action chosen by the learner at time t is\ndenoted by \u03b1t. The learner also decides whether to observe the reward or not, and this decision of\nthe learner at time t is denoted by \u03b2t \u2208 {0, 1}, where \u03b2t = 1 implies that the learner chooses to\nobserve the reward and \u03b2t = 0 implies that the learner does not observe the reward. The learner\u2019s\nperformance loss with respect to the oracle benchmark is de\ufb01ned as the regret, whose value at time\nT is given by\n\nT(cid:88)\n\n\u00b5t(a\u2217(xt)) \u2212 T(cid:88)\n\nR(T ) :=\n\n(\u00b5t(\u03b1t) \u2212 cO\u03b2t).\n\n(1)\n\nA regret that grows sublinearly in T , i.e., O(T \u03b3), \u03b3 < 1, guarantees convergence in terms of the\naverage reward, i.e., R(T )/T \u2192 0. We are interested in achieving sublinear growth with a rate\nindependent of D.\n\nt=1\n\nt=1\n\n3 Online Relevance Learning with Controlled Feedback\n\n3.1 Description of the algorithm\n\nIn this section we propose the algorithm Online Relevance Learning with Controlled Feedback\n(ORL-CF), which learns the best action for each context vector by simultaneously learning the rel-\nevance relation, and then estimating the expected reward of each action. The feedback, i.e., reward\nobservations, is controlled based on the past context vector arrivals, in a way that reward obser-\nvations are only made for actions for which the uncertainty in the reward estimates are high for\nthe current context vector. The controlled feedback feature allows ORL-CF to operate as an active\nlearning algorithm. Operation of ORL-CF can be summarized as follows:\n\u2022 Adaptively discretize (partition) the context space of each type to learn action rewards of\n\u2022 For an action, form reward estimates for pairs of intervals corresponding to pairs of types.\nBased on the accuracy of these estimates, either choose to explore and observe the reward\nor choose to exploit the best estimated action for the current context vector.\n\u2022 In order to choose the best action, compare the reward estimates for pairs of intervals for\nwhich one interval belongs to type i, for each type i and action a. Conclude that type i\n\nsimilar contexts together.\n\n3\n\n\fif Relt(a) = \u2205 then\nelse\n\nRandomly select \u02c6ct(a) from D.\nFor each i \u2208 Relt(a), calculate Vart(i, a) given in (5).\nSet \u02c6ct(a) = arg mini\u2208Relt(a) Vart(i, a).\n\nend if\nCalculate \u00afr\u02c6ct(a)\n\nt\n\n(a) as given in (6).\n\nend for\nSelect \u03b1t = arg maxa\u2208A \u00afr\u02c6ct(a)\n\nt\n\n(p\u02c6ct(a),t, a).\n\nend if\nfor i \u2208 D do\n\nN i(pi,t) + +.\nif N i(pi,t) \u2265 2\u03c1l(pi,t) then\n\nCreate two new level l(pi,t) + 1 intervals p, p(cid:48) whose union gives pi,t.\nPi,t+1 = Pi,t \u222a {p, p(cid:48)} \u2212 {pi,t}.\nRun Initialize(i, {p, p(cid:48)}, t).\nPi,t+1 = Pi,t.\n\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23:\n24:\n25:\n26:\n27:\n28:\n29:\n30:\n31:\n32:\n33:\n34: end while\nInitialize(i, B, t):\n1: for p \u2208 B do\n2:\n\nelse\n\nend if\nend for\nt = t + 1\n\n3: end for\n\nis relevant to a if the variation of the reward estimates does not greatly exceed the natural\nvariation of the expected reward of action a over the interval of type i (calculated using\nAssumption 1).\n\ni\u2208D Ui,t, where Ui,t (given in (3)), is the set of under explored actions for type i.\n\nOnline Relevance Learning with Controlled Feedback (ORL-CF):\n1: Input: L, \u03c1, \u03b4.\n2: Initialization: Pi,1 = {[0, 1]}, i \u2208 D. Run Initialize(i, Pi,1, 1), i \u2208 D.\n3: while t \u2265 1 do\n4:\n5:\n6:\n7:\n8:\n\nObserve xt, \ufb01nd pt that xt belongs to.\nif Ut (cid:54)= \u2205 then\n\nSet Ut :=(cid:83)\n\n(Explore) \u03b2t = 1, select \u03b1t randomly from Ut, observe rt(\u03b1t, xt).\nUpdate pairwise sample means: for all q \u2208 Qt, given in (2).\n\u00afrind(q)(q, \u03b1t) = (Sind(q)(q, \u03b1t)\u00afrind(q)(q, \u03b1t) + rt(\u03b1t, xt))/(Sind(q)(q, \u03b1t) + 1).\nUpdate counters: for all q \u2208 Qt, Sind(q)(q, \u03b1t) + +.\n(Exploit) \u03b2t = 0, for each a \u2208 A calculate the set of candidate relevant contexts Relt(a) given\nin (4).\nfor a \u2208 A do\n\nelse\n\n9:\n10:\n11:\n\nSet N i(p) = 0, \u00afri,j (p, pj , a) = \u00afrj,i(pj , p, a) = 0, Si,j (p, pj , a) = Sj,i(pj , p, a) = 0 for all a \u2208 A,\nj \u2208 D\u2212i and pj \u2208 Pj,t.\n\nFigure 2: Pseudocode for ORL-CF.\n\nSince the number of contexts is in\ufb01nite, learning the reward of an action for each context is not\nfeasible. In order to learn fast, ORL-CF exploits the similarities between the contexts of the relevant\ntype given in Assumption 1 to estimate the rewards of the actions. The key to success of our algo-\nrithm is that this estimation is good enough. ORL-CF adaptively forms the partition of the space\nfor each type in D, where the partition for the context space of type i at time t is denoted by Pi,t.\nAll the elements of Pi,t are disjoint intervals of Xi = [0, 1] whose lengths are elements of the set\n{1, 2\u22121, 2\u22122, . . .}.4 An interval with length 2\u2212l, l \u2265 0 is called a level l interval, and for an interval\np, l(p) denotes its level, s(p) denotes its length. By convention, intervals are of the form (a, b], with\nthe only exception being the interval containing 0, which is of the form [0, b].5 Let pi,t \u2208 Pi,t be the\ninterval that xi,t belongs to, pt := (p1,t, . . . , pD,t) and P t := (P1,t, . . . ,PD,t).\n\n4Setting interval lengths to powers of 2 is for presentational simplicity. In general, interval lengths can be\n\nset to powers of any real number greater than 1.\n\n5Endpoints of intervals will not matter in our analysis, so our results will hold even when the intervals have\n\ncommon endpoints.\n\n4\n\n\fThe pseudocode of ORL-CF is given in Fig. 2. ORL-CF starts with Pi,1 = {Xi} = {[0, 1]} for\neach i \u2208 D. As time goes on and more contexts arrive for each type i, it divides Xi into smaller and\nsmaller intervals. The idea is to combine the past observations made in an interval to form sample\nmean reward estimates for each interval, and use it to approximate the expected rewards of actions\nfor contexts lying in these intervals. The intervals are created in a way to balance the variation of\nthe sample mean rewards due to the number of past observations that are used to calculate them and\nthe variation of the expected rewards in each interval.\nWe also call Pi,t the set of active intervals for type i at time t. Since the partition of each type is\nadaptive, as time goes on, new intervals become active while old intervals are deactivated, based on\nt (p) be the number of times xi,t(cid:48) \u2208 p \u2208 Pi,t(cid:48) for\nhow contexts arrive. For a type i interval p, let N i\nt(cid:48) \u2264 t. The duration of time that an interval remains active, i.e., its lifetime, is determined by an input\nparameter \u03c1 > 0, which is called the duration parameter. Whenever the number of arrivals to an\ninterval p exceeds 2\u03c1l(p), ORL-CF deactivates p and creates two level l(p)+1 intervals, whose union\nt (pi,t) \u2265 2\u03c1l,\ngives p. For example, when pi,t = (k2\u2212l, (k + 1)2\u2212l] for some 0 < k \u2264 2l \u2212 1 if N i\nORL-CF sets\n\nPi,t+1 = Pi,t \u222a {(k2\u2212l, (k + 1/2)2\u2212l], ((k + 1/2)2\u2212l, (k + 1)2\u2212l]} \u2212 {pi,t}.\n\nOtherwise Pi,t+1 remains the same as Pi,t. It is easy to see that the lifetime of an interval increases\nexponentially in its duration parameter.\nWe next describe the counters, control numbers and sample mean rewards the learner keeps for each\npair of intervals corresponding to a pair of types to determine whether to explore or exploit and how\nto exploit. Let D\u2212i := D \u2212 {i}. For type i, let Qi,t := {(pi,t, pj,t) : j \u2208 D\u2212i} be the pair of\nintervals that are related to type i at time t, and let\n\n(2)\nTo denote an element of Qi,t or Qt we use index q. For any q \u2208 Qt, the corresponding pair of types\nis denoted by ind(q). For example, ind((pi,t, pj,t)) = i, j. The decision to explore or exploit at time\nt is solely based on pt. For events A1, . . . , AK, let I(A1, . . . , Ak) denote the indicator function of\n\nQt :=\n\nQi,t.\n\ni\u2208D\n\nk=1:K Ak. For p \u2208 Pi,t, p(cid:48) \u2208 Pj,t, let\n\nevent(cid:84)\n\n(cid:91)\n\nt (p, p(cid:48), a) :=\nSi,j\n\nI (\u03b1t(cid:48) = a, \u03b2t = 1, pi,t(cid:48) = p, pj,t(cid:48) = p(cid:48)) ,\n\nt\u22121(cid:88)\n\nt(cid:48)=1\n\nbe the number of times a is selected and the reward is observed when the type i context is in p and\ntype j context is in p(cid:48), summed over times when both intervals are active. Also for the same p and\np(cid:48) let\n\nt (p, p(cid:48), a) :=\n\u00afri,j\n\nrt(a, xt)I (\u03b1t(cid:48) = a, \u03b2t = 1, pi,t(cid:48) = p, pj,t(cid:48) = p(cid:48))\n\n/(Si,j\n\nt (p, p(cid:48), a)),\n\n(cid:33)\n\n(cid:32) t\u22121(cid:88)\n\nt(cid:48)=1\n\nbe the pairwise sample mean reward of action a for pair of intervals (p, p(cid:48)).\nAt time t, ORL-CF assigns a control number to each i \u2208 D denoted by\n\nDi,t :=\n\n2 log(tD|A|/\u03b4)\n\n(Ls(pi,t))2\n\n,\n\nUi,t := {a \u2208 A : Sind(q)\n\nwhich depends on the cardinality of A, the length of the active interval that type i context is in\nat time t and a con\ufb01dence parameter \u03b4 > 0, which controls the accuracy of sample mean reward\nestimates. Then, it computes the set of under-explored actions for type i as\n\n(q, a) < Di,t for some q \u2208 Qi(t)},\n\nand then, the set of under-explored actions as Ut :=(cid:83)\n\n(3)\ni\u2208D Ui,t. The decision to explore or exploit is\nbased on whether or not Ut is empty.\n(i) If Ut (cid:54)= \u2205, ORL-CF randomly selects an action \u03b1t \u2208 Ut to explore, and observes its reward\nrt(\u03b1t, xt). Then, it updates the pairwise sample mean rewards and pairwise counters for all q \u2208 Qt,\n\u00afrind(q)\nt+1 (q, \u03b1t) =\n\n, Sind(q)\n\nt+1 (q, \u03b1t) = Sind(q)\n\nt+1 (q,\u03b1t)+rt(\u03b1t,xt)\n\nSind(q)\nt\n\n(q,\u03b1t)\u00afrind(q)\n\n(q, \u03b1t) + 1.\n\nt\n\nt\n\nSind(q)\nt\n\n(q,\u03b1t)+1\n\n5\n\n\f(ii) If Ut = \u2205, ORL-CF exploits by estimating the relevant type \u02c6ct(a) for each a \u2208 A and forming\nsample mean reward estimates for action a based on \u02c6ct(a). It \ufb01rst computes the set of candidate\nrelevant types for each a \u2208 A,\nRelt(a) := {i \u2208 D : |\u00afri,j\n\n(4)\nThe intuition is that if i is the type that is relevant to a, then independent of the values of the contexts\nof the other types, the variation of the pairwise sample mean reward of a over pi,t must be very close\nto the variation of the expected reward of a in that interval.\nIf Relt(a) is empty, this implies that ORL-CF failed to identify the relevant type, hence \u02c6ct(a) is\nrandomly selected from D. If Relt(a) is nonempty, ORL-CF computes the maximum variation\n\nt (pi,t, pk,t, a)| \u2264 3Ls(pi,t),\u2200j, k \u2208 D\u2212i}.\n\nt (pi,t, pj,t, a) \u2212 \u00afri,k\n\n|\u00afri,j\nt (pi,t, pj,t, a) \u2212 \u00afri,k\n\nt (pi,t, pk,t, a)|,\n\nVart(i, a) := max\nj,k\u2208D\u2212i\n\n(5)\nfor each i \u2208 Relt(a). Then it sets \u02c6ct(a) = mini\u2208Relt(a) Vart(i, a). This way, whenever the type\nrelevant to action a is in Relt(a), even if it is not selected as the estimated relevant type, the sample\nmean reward of a calculated based on the estimated relevant type will be very close to the sample\nmean of its reward calculated according to the true relevant type. After \ufb01nding the estimated relevant\ntypes, the sample mean reward of each action is computed based on its estimated relevant type as\n\nj\u2208D\u2212\u02c6ct(a)\n\n\u00afr\u02c6ct(a),j\nt\n\n(p\u02c6ct(a),t, pj,t, a)S \u02c6ct(a),j\n\nt\n\n(p\u02c6ct(a),t, pj,t, a)\n\n.\n\n(6)\n\nj\u2208D\u2212\u02c6ct(a)\nThen, ORL-CF selects \u03b1t = arg maxa\u2208A \u00afr\u02c6ct(a)\nexploitations, pairwise sample mean rewards and counters are not updated.\n\n(p\u02c6ct(a),t, pj,t, a)\n\nt\n\nS \u02c6ct(a),j\nt\n\n(p\u02c6ct(a),t, a). Since the reward is not observed in\n\n(cid:80)\n\n\u00afr\u02c6ct(a)\nt\n\n(a) :=\n\n(cid:80)\n\n3.2 Regret analysis of ORL-CF\nLet \u03c4 (T ) \u2282 {1, 2, . . . , T} be the set of time steps in which ORL-CF exploits by time T . \u03c4 (T ) is a\nrandom set which depends on context arrivals and the randomness of the action selection of ORL-\nCF. The regret R(T ) de\ufb01ned in (1) can be written as a sum of the regret incurred during explorations\n(denoted by RO(T )) and the regret incurred during exploitations (denoted by RI(T )). The following\ntheorem gives a bound on the regret of ORL-CF in exploitation steps.\nTheorem 1. Let ORL-CF run with duration parameter \u03c1 > 0, con\ufb01dence parameter \u03b4 > 0 and\ncontrol numbers Di,t := 2 log(t|A|D/\u03b4)\n, for i \u2208 D. Let Rinst(t) be the instantaneous regret at time t,\nwhich is the loss in expected reward at time t due to not selecting a\u2217(xt). Then, with probability at\nleast 1 \u2212 \u03b4, we have\n\n(Ls(pi,t))2\n\nfor all t \u2208 \u03c4 (T ), and the total regret in exploitation steps is bounded above by\n\nRinst(t) \u2264 8L(s(pR(\u03b1t),t) + s(pR(a\u2217(xt)),t)),\n(cid:88)\n\n(s(pR(\u03b1t),t + s(pR(a\u2217(xt)),t)) \u2264 16L22\u03c1T \u03c1/(1+\u03c1),\n\nRI(T ) \u2264 8L\n\nt\u2208\u03c4 (T )\n\nfor arbitrary context vectors x1, x2, . . . , xT .\n\nTheorem 1 provides both context arrival process dependent and worst case bounds on the exploita-\ntion regret of ORL-CF. By choosing \u03c1 arbitrarily close to zero, RI(T ) can be made O(T \u03b3) for any\n\u03b3 > 0. While this is true, the reduction in regret for smaller \u03c1 not only comes from increased accu-\nracy, but it is also due to the reduction in the number of time steps in which ORL-CF exploits, i.e.,\n|\u03c4 (T )|. By de\ufb01nition time t is an exploitation step if\n\nt (pi,t, pj,t, a) \u2265\nSi,j\n\n2 log(t|A|D/\u03b4)\n\nL2 min{s(pi,t)2, s(pj,t)2} =\n\n22 max{l(pi,t),l(pj,t)}+1 log(t|A|D/\u03b4)\n\n,\n\nL2\n\nfor all q = (pi,t, pj,t) \u2208 Qt, i, j \u2208 D. This implies that for any q \u2208 Qi,t which has the interval\nwith maximum level equal to l, \u02dcO(22l) explorations are required before any exploitation can take\nplace. Since the time a level l interval can stay active is 2\u03c1l, it is required that \u03c1 \u2265 2 so that \u03c4 (T ) is\nnonempty.\nThe next theorem gives a bound on the regret of ORL-CF in exploration steps.\n\n6\n\n\fTheorem 2. Let ORL-CF run with \u03c1, \u03b4 and Di,t, i \u2208 D values as stated in Theorem 1. Then,\n\nRO(T ) \u2264 960D2(cO + 1) log(T|A|D/\u03b4)\n\n7L2\n\nT 4/\u03c1 +\n\n64D2(cO + 1)\n\n3\n\nT 2/\u03c1,\n\nwith probability 1, for arbitrary context vectors x1, x2, . . . , xT .\n\nBased on the choice of the duration parameter \u03c1, which determines how long an interval will stay\nactive, it is possible to get different regret bounds for explorations and exploitations. Any \u03c1 > 4 will\ngive a sublinear regret bound for both explorations and exploitations. The regret in exploitations\nincreases in \u03c1 while the regret in explorations decreases in \u03c1.\n\u221a\nTheorem 3. Let ORL-CF run with \u03b4 and Di,t, i \u2208 D values as stated in Theorem 1 and \u03c1 = 2+2\n2.\nThen, the time order of exploration and exploitation regrets are balanced up to logaritmic orders.\n\u221a\n\u221a\nWith probability at least 1\u2212 \u03b4 we have both RI(T ) = \u02dcO(T 2/(1+\n2)) .\nRemark 1. Prior work on contextual bandits focused on balancing the regret due to exploration\nand exploitation. For example in [1, 2], for a D-dimensional context vector algorithms are shown\nto achieve \u02dcO(T (D+1)/(D+2)) regret.6 Also in [1] a O(T (D+1)/(D+2)) lower bound on the regret\nis proved. An interesting question is to \ufb01nd the tightest lower bound for contextual bandits with\nrelevance function. One trivial lower bound is O(T 2/3), which corresponds to D = 1. However,\nsince \ufb01nding the action with the highest expected reward for a context vector requires comparisons\nof estimated rewards of actions with different relevant types, which requires accurate sample mean\nreward estimates for 2 dimensions of the context space corresponding to those types, we conjecture\nthat a tighter lower bound is O(T 3/4). Proving this is left as future work.\n\n2)) and RO(T ) = \u02dcO(T 2/(1+\n\nAnother interesting case is when actions with suboptimality greater than \u0001 > 0 must never be chosen\nin any exploitation step by time T . When such a condition is imposed, ORL-CF can start with\npartitions Pi,1 that have sets with high levels such that it explores more at the beginning to have\nmore accurate reward estimates before any exploitation. The following theorem gives the regret\nbound of ORL-CF for this case.\nTheorem 4. Let ORL-CF run with duration parameter \u03c1 > 0, con\ufb01dence parameter \u03b4 > 0, control\nnumbers Di,t := 2 log(t|A|D/\u03b4)\n, and with initial partitions Pi,1, i \u2208 D consisting of intervals of\nlength lmin = (cid:100)log2(3L/(2\u0001))(cid:101). Then, with probability 1 \u2212 \u03b4, Rinst(t) \u2264 \u0001 for all t \u2208 \u03c4 (T ),\nRI(T ) \u2264 16L22\u03c1T \u03c1/(1+\u03c1) and\n\n(Ls(pi,t))2\n\n(cid:18) 960D2(cO + 1) log(T|A|D/\u03b4)\n\n7L2\n\nRO(T ) \u2264 81L4\n\u00014\n\nT 4/\u03c1 +\n\n64D2(cO + 1)\n\n3\n\nT 2/\u03c1\n\n(cid:19)\n\n,\n\n\u221a\n\nfor arbitrary context vectors x1, x2, . . . , xT . Bounds on RI(T ) and RO(T ) are balanced for \u03c1 =\n2 + 2\n\n2.\n\n3.3 Future Work\n\nIn this paper we only considered the relevance relations that are functions. Similar learning methods\ncan be developed for more general relevance relations such as the ones given in Fig. 1 (i) and (ii).\nFor example, for the general case in Fig. 1 (i), if |R(a)| \u2264 Drel << D, for all a \u2208 A, and Drel is\nknown by the learner, the following variant of ORL-CF can be used to achieve regret whose time\norder depends only on Drel but not on D.\n\n\u2022 Instead of keeping pairwise sample mean reward estimates, keep sample mean reward esti-\n\nmates of actions for Drel + 1 tuples of intervals of Drel + 1 types.\n\n\u2022 For a Drel tuple of types i, let Qi,t be the Drel + 1 tuples of intervals that are related to i\nat time t, and Qt be the union of Qi,t over all Drel tuples of types. Similar to ORL-CF,\ncompute the set of under-explored actions Ui,t, and the set of candidate relevant Drel tuples\nof types Relt(a), using the newly de\ufb01ned sample mean reward estimates.\n\n6The results are shown in terms of the covering dimension which reduces to Euclidian dimension for our\n\nproblem.\n\n7\n\n\f\u2022 In exploitation, set \u02c6ct(a) to be the Drel tuple of types with the minimax variation, where the\nvariation of action a for a tuple i is de\ufb01ned similar to (5), as the maximum of the distance\nbetween the sample mean rewards of action a for Drel+1 tuples that are in Qi,t.\n\nAnother interesting case is when the relevance relation is linear as given in Fig. 1 (ii). For example,\nfor action a if there is a type i that is much more relevant compared to other types j \u2208 D\u2212i, i.e.,\nwa,i >> wa,j, where the weights wa,i are given in Fig. 1, then ORL-CF is expected to have good\nperformance (but not sublinear regret with respect to the benchmark that knows R).\n\n4 Related Work\n\nContextual bandit problems are studied by many others in the past [3, 4, 1, 2, 5, 6]. The prob-\nlem we consider in this paper is a special case of the Lipschitz contextual bandit problem [1, 2],\nwhere the only assumption is the existence of a known similarity metric between the expected re-\nwards of actions for different contexts. It is known that the lower bound on regret for this problem\nis O(T (D+1)/(D+2)) [1], and there exists algorithms that achieve \u02dcO(T (D+1)/(D+2)) regret [1, 2].\nCompared to the prior work above, ORL-CF only needs to observe rewards in explorations and has\na regret whose time order is independent of D. Hence it can still learn the optimal actions fast\nenough in settings where observations are costly and context vector is high dimensional.\nExamples of related works that consider limited observations are KWIK learning [7, 8] and label\nef\ufb01cient learning [9, 10, 11]. For example, [8] considers a bandit model where the reward function\ncomes from a parameterized family of functions and gives bound on the average regret. An online\nprediction problem is considered in [9, 10, 11], where the predictor (action) lies in a class of linear\npredictors. The benchmark of the context is the best linear predictor. This restriction plays a crucial\nrole in deriving regret bounds whose time order does not depend on D. Similar to these works,\nORL-CF can guarantee with a high probability that actions with large suboptimalities will never\nbe selected in exploitation steps. However, we do not have any assumptions on the form of the\nexpected reward function other than the Lipschitz continuity and that it depends on a single type for\neach action.\nIn [12] graphical bandits are proposed where the learner takes an action vector a which includes\nactions from several types that consitute a type set T . The expected reward of a for context vector x\ncan be decomposed into sum of reward functions each of which only depends on a subset of D \u222a T .\nHowever, it is assumed that the form of decomposition is known but the functions are not known.\nAnother work [13] proposes a fast learning algorithm for an i.i.d. contextual bandit problem in\nwhich the rewards for contexts and actions are sampled from a joint probability distribution. In this\nwork the authors consider learning the best policy from a \ufb01nite set of policies with oracle access,\nand prove a regret bound of O(\nT ) which is also logarithmic in the size of the policy space. In\ncontrast, in our problem (i) contexts arrive according to an arbitrary exogenous process, and the\naction rewards are sampled from an i.i.d. distribution given the context value, (ii) the set of policies\nthat the learner can adopt is not restricted.\nLarge dimensional action spaces, where the rewards depend on a subset of the types of actions are\nconsidered in [14] and [15]. [14] considers the problem when the reward is H\u00a8older continuous in\nan unknown low-dimensional tuple of types, and uses a special discretization of the action space\nto achieve dimension independent bounds on the regret. This discretization can be effectively used\nsince the learner can select the actions, as opposed to our case where the learner does not have any\ncontrol over contexts. [15] considers the problem of optimizing high dimensional functions that\nhave an unknown low dimensional structure from noisy observations.\n\n\u221a\n\n5 Conclusion\n\nIn this paper we formalized the problem of learning the best action through learning the relevance\nrelation between types of contexts and actions. For the case when the relevance relation is a function,\nwe proposed an algorithm that (i) has sublinear regret with time order independent of D, (ii) only\nrequires reward observations in explorations, (iii) for any \u0001 > 0, does not select any \u0001 suboptimal\nactions in exploitations with a high probability. In the future we will extend our results to the linear\nand general relevance relations illustrated in Fig. 1.\n\n8\n\n\fReferences\n[1] T. Lu, D. P\u00b4al, and M. P\u00b4al, \u201cContextual multi-armed bandits,\u201d in International Conference on\n\nArti\ufb01cial Intelligence and Statistics (AISTATS), 2010, pp. 485\u2013492.\n\n[2] A. Slivkins, \u201cContextual bandits with similarity information,\u201d in Conference on Learning The-\n\nory (COLT), 2011.\n\n[3] E. Hazan and N. Megiddo, \u201cOnline learning with prior knowledge,\u201d in Learning Theory.\n\nSpringer, 2007, pp. 499\u2013513.\n\n[4] J. Langford and T. Zhang, \u201cThe epoch-greedy algorithm for contextual multi-armed bandits,\u201d\n\nAdvances in Neural Information Processing Systems (NIPS), vol. 20, pp. 1096\u20131103, 2007.\n\n[5] W. Chu, L. Li, L. Reyzin, and R. E. Schapire, \u201cContextual bandits with linear payoff functions,\u201d\nin International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2011, pp. 208\u2013\n214.\n\n[6] M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang, \u201cEf\ufb01cient\n\noptimal learning for contextual bandits,\u201d arXiv preprint arXiv:1106.2369, 2011.\n\n[7] L. Li, M. L. Littman, T. J. Walsh, and A. L. Strehl, \u201cKnows what it knows: a framework for\n\nself-aware learning,\u201d Machine Learning, vol. 82, no. 3, pp. 399\u2013443, 2011.\n\n[8] K. Amin, M. Kearns, M. Draief, and J. D. Abernethy, \u201cLarge-scale bandit problems and KWIK\n\nlearning,\u201d in International Conference on Machine Learning (ICML), 2013, pp. 588\u2013596.\n\n[9] N. Cesa-Bianchi, C. Gentile, and F. Orabona, \u201cRobust bounds for classi\ufb01cation via selective\n\nsampling,\u201d in International Conference on Machine Learning (ICML), 2009, pp. 121\u2013128.\n\n[10] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari, \u201cEf\ufb01cient bandit algorithms for online mul-\nticlass prediction,\u201d in International Conference on Machine Learning (ICML), 2008, pp. 440\u2013\n447.\n\n[11] E. Hazan and S. Kale, \u201cNewtron: an ef\ufb01cient bandit algorithm for online multiclass prediction.\u201d\n\nin Advances in Neural Information Processing Systems (NIPS), 2011, pp. 891\u2013899.\n\n[12] K. Amin, M. Kearns, and U. Syed, \u201cGraphical models for bandit problems,\u201d in Conference on\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2011.\n\n[13] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. E. Schapire, \u201cTaming the monster: A\n\nfast and simple algorithm for contextual bandits,\u201d arXiv preprint arXiv:1402.0555, 2014.\n\n[14] H. Tyagi and B. Gartner, \u201cContinuum armed bandit problem of few variables in high dimen-\n\nsions,\u201d in Workshop on Approximation and Online Algorithms (WAOA), 2014, pp. 108\u2013119.\n\n[15] J. Djolonga, A. Krause, and V. Cevher, \u201cHigh-dimensional Gaussian process bandits,\u201d in Ad-\n\nvances in Neural Information Processing Systems (NIPS), 2013, pp. 1025\u20131033.\n\n9\n\n\f", "award": [], "sourceid": 698, "authors": [{"given_name": "Cem", "family_name": "Tekin", "institution": "UCLA"}, {"given_name": "Mihaela", "family_name": "van der Schaar", "institution": "University of California, Los Angeles"}]}