{"title": "Censored Semi-Bandits: A Framework for Resource Allocation with Censored Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 14526, "page_last": 14536, "abstract": "In this paper, we study Censored Semi-Bandits, a novel variant of the semi-bandits problem. The learner is assumed to have a fixed amount of resources, which it allocates to the arms at each time step. The loss observed from an arm is random and depends on the amount of resources allocated to it. More specifically, the loss equals zero if the allocation for the arm exceeds a constant (but unknown) threshold that can be dependent on the arm. Our goal is to learn a feasible allocation that minimizes the expected loss. The problem is challenging because the loss distribution and threshold value of each arm are unknown. We study this novel setting by establishing its `equivalence' to Multiple-Play Multi-Armed Bandits (MP-MAB) and Combinatorial Semi-Bandits. Exploiting these equivalences, we derive optimal algorithms for our setting using existing algorithms for MP-MAB and Combinatorial Semi-Bandits. Experiments on synthetically generated data validate performance guarantees of the proposed algorithms.", "full_text": "Censored Semi-Bandits: A Framework for Resource\n\nAllocation with Censored Feedback\n\nArun Verma\n\nDepartment of IEOR\nIIT Bombay, India\n\nv.arun@iitb.ac.in\n\nArun Rajkumar\nDepartment of CSE\nIIT Madras, India\n\narunr@cse.iitm.ac.in\n\nManjesh K. Hanawal\nDepartment of IEOR\nIIT Bombay, India\n\nmhanwal@iitb.ac.in\n\nRaman Sankaran\n\nLinkedIn India\nBengaluru, India\n\nrsankara@linkedin.com\n\nAbstract\n\nIn this paper, we study Censored Semi-Bandits, a novel variant of the semi-bandits\nproblem. The learner is assumed to have a \ufb01xed amount of resources, which it\nallocates to the arms at each time step. The loss observed from an arm is random\nand depends on the amount of resources allocated to it. More speci\ufb01cally, the\nloss equals zero if the allocation for the arm exceeds a constant (but unknown)\nthreshold that can be dependent on the arm. Our goal is to learn a feasible allocation\nthat minimizes the expected loss. The problem is challenging because the loss\ndistribution and threshold value of each arm are unknown. We study this novel\nsetting by establishing its \u2018equivalence\u2019 to Multiple-Play Multi-Armed Bandits\n(MP-MAB) and Combinatorial Semi-Bandits. Exploiting these equivalences, we\nderive optimal algorithms for our setting using the existing algorithms for MP-MAB\nand Combinatorial Semi-Bandits. Experiments on synthetically generated data\nvalidate performance guarantees of the proposed algorithms.\n\n1\n\nIntroduction\n\nMany real-life sequential resource allocation problems have a censored feedback structure. Consider,\nfor instance, the problem of optimally allocating patrol of\ufb01cers (resources) across various locations\nin a city on a daily basis to combat opportunistic crimes. Here, a perpetrator picks a location (e.g., a\ndeserted street) and decides to commit a crime (e.g., mugging) but does not go ahead with it if a patrol\nof\ufb01cer happens to be around in the vicinity. Though the true potential crime rate depends on the latent\ndecision of the perpetrator, one observes feedback only when the crime is committed. Thus crimes\nthat were planned but not committed get censored. This model of censoring is quite general and \ufb01nds\napplications in several resource allocation problems such as police patrolling (Curtin et al., 2010),\ntraf\ufb01c regulations and enforcement (Adler et al., 2014; Rosenfeld and Kraus, 2017), poaching control\n(Nguyen et al., 2016; Gholami et al., 2018), supplier selection (Abernethy et al., 2016), advertisement\nbudget allocation (Lattimore et al., 2014), among many others.\nExisting approaches that deal with censored feedback in resource allocation problems fall into two\nbroad categories. Curtin et al. (2010); Adler et al. (2014); Rosenfeld and Kraus (2017) learn good\nresource allocations from historical data. Nguyen et al. (2016); Gholami et al. (2018); Zhang et al.\n(2016); Sinha et al. (2018) pose the problem in a game-theoretic framework (opportunistic security\ngames) and propose algorithms for optimal resource allocation strategies. While the \ufb01rst approach\nfails to capture the sequential nature of the problem, the second approach is agnostic to the user\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(perpetrator) behavioral modeling. In this work, we balance these two approaches by proposing a\nsimple yet novel threshold-based behavioral model, which we term as Censored Semi Bandits (CSB).\nThe model captures how a user opportunistically reacts to an allocation.\nIn the \ufb01rst variation of our proposed behavioral model, we assume the threshold (user behavioral)\nis uniform across arms (locations). We establish that this setup is \u2018equivalent\u2019 to Multiple-Play\nMulti-Armed Bandits (MP-MAB), where a \ufb01xed number of arms is played in each round. We also\nstudy the more general variation where the threshold is arm dependent. We establish that this setup is\nequivalent to Combinatorial Semi-Bandits, where a subset of arms to be played is decided by solving\na combinatorial 0-1 knapsack problem.\nFormally, we tackle the sequential nature of the resource allocation problem by establishing its\nequivalence to the MP-MAB and Combinatorial Semi-Bandits framework. By exploiting this\nequivalence for our proposed threshold-based behavioral model, we develop novel resource allocation\nalgorithms by adapting existing algorithms and provide optimal regret guarantees for the same.\n\nRelated Work: The problem of resource allocation for tackling crimes has received signi\ufb01cant\ninterest in recent times. Curtin et al. (2010) employ a static maximum coverage strategy for spatial\npolice allocation while Nguyen et al. (2016) and Gholami et al. (2018) study game-theoretic and\nadversarial perpetrator strategies. We, on the other hand, restrict ourselves to a non-adversarial setting.\n(Adler et al., 2014) and Rosenfeld and Kraus (2017) look at traf\ufb01c police resource deployment and\nconsider the optimization aspects of the problem using real-time traf\ufb01c, etc., which differs from the\nmain focus of our work. Zhang et al. (2015) investigate dynamic resource allocation in the context\nof police patrolling and poaching for opportunistic criminals. Here they attempt to learn a model of\ncriminals using a dynamic Bayesian network. Our approach proposes simpler and realistic modeling\nof perpetrators where the underlying structure can be exploited ef\ufb01ciently.\nWe pose our problem in the exploration-exploitation paradigm, which involves solving the MP-MAB\nand combinatorial 0-1 knapsack problem. It is different from the bandits with Knapsacks setting\nstudied in Badanidiyuru et al. (2018), where resources get consumed in every round. The work of\nAbernethy et al. (2016) and Jain and Jamieson (2018) are similar to us in the sense that they are\nalso threshold-based settings. However, the thresholding we employ naturally \ufb01ts our problem and\nsigni\ufb01cantly differs from theirs. Speci\ufb01cally, their thresholding is on a sample generated from an\nunderlying distribution, whereas we work in a Bernoulli setting where the thresholding is based on the\nallocation. Resource allocation with semi-bandits feedback (Lattimore et al., 2014, 2015; Dagan and\nCrammer, 2018) study a related but less general setup where the reward is based only on allocation\nand a hidden threshold. Our setting requires an additional unknown parameter for each arm, a \u2018mean\nloss,\u2019 which also affects the reward.\nAllocation problems in the combinatorial setting have been explored in Cesa-Bianchi and Lugosi\n(2012); Chen et al. (2013); Rajkumar and Agarwal (2014); Combes et al. (2015); Chen et al. (2016);\nWang and Chen (2018). Even though these are not related to our setting directly, we derive explicit\nconnections to a sub-problem of our algorithm to the setup of Komiyama et al. (2015) and Wang and\nChen (2018).\n\n2 Problem Setting\n\nWe consider a sequential learning problem where K denotes the number of arms (locations), and\nQ denotes the amount of divisible resources. The loss at arm i \u2208 [K] where [K] := {1, 2, . . . , K},\nis Bernoulli distributed with rate \u00b5i \u2208 (0, 1]. Each arm can be assigned a fraction of resource that\ndecides the feedback observed and the loss incurred from that arm \u2013 if the allocated resource is\nsmaller than a certain threshold1, the loss incurred is the realization of the arm, and it is observed.\nOtherwise, the realization is unobserved, and the incurred loss is zero. Let a := {ai : i \u2208 [K]},\nwhere ai \u2208 [0, 1] denotes the resource allocated to arm i. For each i \u2208 [K], let \u03b8i \u2208 (0, 1] denote\nthe threshold associated with arm i and is such that a loss is incurred at arm i only if ai < \u03b8i. An\ni\u2208[K] ai \u2264 Q and set of all feasible allocations is\ndenoted as AQ. The goal is to \ufb01nd a feasible resource allocation that results in a maximum reduction\nin the mean loss incurred.\n\nallocation vector a is said to be feasible if(cid:80)\n\n1One could consider a smooth function instead of a step function, but the analysis is more involved, and our\n\nresults need not generalize straightforwardly.\n\n2\n\n\fIn our setup, resources may be allocated to multiple arms. However, loss from each of the allocated\narms may not be observed depending on the amount of resources allocated to them. We thus have a\nversion of the partial monitoring system (Cesa-Bianchi et al., 2006; Bart\u00f3k and Szepesv\u00e1ri, 2012;\nBart\u00f3k et al., 2014) with semi-bandit feedback, and we refer to it as censored semi-bandits (CSB). The\nvectors \u03b8 := {\u03b8i}i\u2208[K] and \u00b5 := {\u00b5j}i\u2208[K] are unknown and identify an instance of the CSB problem.\nHenceforth we identify a CSB instance as P = (\u00b5, \u03b8, Q) \u2208 [0, 1]K \u00d7 (0, 1]K \u00d7 R+ and denote the\ncollection of CSB instances as PCSB. As the number of arms K = |\u00b5|, K is known (implicitly)\nfrom an instance of CSB. For simplicity of discussion, we assume that \u00b51 \u2264 \u00b52 \u2264 . . . \u2264 \u00b5K, but\nthe algorithms are not aware of this ordering. For instance P \u2208 PCSB, the optimal allocation can be\ncomputed by the following 0-1 knapsack problem\n\nK(cid:88)\n\ni=1\n\na(cid:63) \u2208 arg min\na\u2208AQ\n\n\u00b5i1{ai<\u03b8i}.\n\n(1)\n\nInteraction between the environment and a learner is given in Algorithm 1.\n\nAlgorithm 1 CSB Learning Protocol on instance (\u00b5, \u03b8, Q)\nFor each round t:\n\nE [Xt,i] = \u00b5i and the sequence (Xt,i)t\u22651 is i.i.d. for all i \u2208 [K].\n\n1. Environment generates a vector Xt = (Xt,1, Xt,2, . . . , Xt,K) \u2208 {0, 1}K, where\n2. Learner picks an allocation vector at \u2208 AQ.\n3. Feedback and Loss: The learner observes a random feedback Yt = {Yt,i : i \u2208 [K]},\n\nwhere Yt,i = Xt,i1{at,i<\u03b8i} and incurs loss(cid:80)\n\ni\u2208[K] Yt,i.\n\nThe goal of the learner is to \ufb01nd a feasible resource allocation strategy at every round such that the\ncumulative loss is minimized. Speci\ufb01cally, we measure the performance of a policy that selects\nallocations {at}t\u22651 in T steps in terms of expected (pseudo) regret given by\n\n(cid:34) T(cid:88)\n\nK(cid:88)\n\n(cid:35)\n\nK(cid:88)\n\nE[RT ] = E\n\nXt,i1{at,i<\u03b8i}\n\n\u2212 T\n\nt=1\n\ni=1\n\ni=1\n\n\u00b5i1{a(cid:63)\n\ni <\u03b8i}.\n\n(2)\n\nA good policy should have sub-linear expected regret, i.e., E [RT ] /T \u2192 0 as T \u2192 \u221e.\n\nIdentical Threshold for All Arms\n\n3\nIn this section, we focus on the particular case of the CSB problem where \u03b8i = \u03b8c for all i \u2208 [K]. With\nabuse of notation, we continue to denote an instance of CSB with the same threshold as (\u00b5, \u03b8c, Q),\nwhere \u03b8c \u2208 (0, 1] is the same threshold.\nDe\ufb01nition 1. For a given loss vector \u00b5 and amount of resources Q, we say that thresholds \u03b8c and \u02c6\u03b8c\nare allocation equivalent if the following holds:\n\nK(cid:88)\n\ni=1\n\nK(cid:88)\n\ni=1\n\nmin\na\u2208AQ\n\n\u00b5i1{ai<\u03b8c} = min\na\u2208AQ\n\n\u00b5i1{ai<\u02c6\u03b8c}.\n\nThough \u03b8c can take any value in the interval (0, 1], an allocation equivalent to it can be con\ufb01ned to a\n\ufb01nite set. The following lemma shows that the search for an allocation equivalent can be restricted to\n(cid:100)K \u2212 Q + 1(cid:101) elements.\nLemma 1. For any \u03b8c \u2208 (0, 1] and Q \u2265 \u03b8c, let M = min{(cid:98)Q/\u03b8c(cid:99) , K} and \u02c6\u03b8c = Q/M. Then \u03b8c\nand \u02c6\u03b8c are allocation equivalent. Further, \u02c6\u03b8c \u2208 \u0398 where \u0398 = {Q/K, Q/(K \u2212 1),\u00b7\u00b7\u00b7 , min{1, Q}}.\nLet M = min{(cid:98)Q/\u03b8c(cid:99), K}. In the following, when arms are sorted in the increasing order of mean\nlosses, we refer to the last M arms as the bottom-M arms and the remaining arms as top-(K \u2212 M )\n\n3\n\n\farms. It is easy to note that an optimal allocation with the same threshold \u03b8c is to allocate \u03b8c amount\nof resource to each of the bottom-M arms and allocate the remaining resources to the other arms.\nLemma 1 shows that range of allocation equivalent \u02c6\u03b8c for any instance (\u00b5, \u03b8c, Q) is \ufb01nite. Once this\nvalue is found, the problem reduces to identifying the bottom-M arms and assigning resource \u02c6\u03b8c to\neach one of them to minimize the mean loss. The latter part is equivalent to solving a Multiple-Play\nMulti-Armed Bandits problem, as discussed next.\n\n3.1 Equivalence to Multiple-play Multi-armed Bandits\n\nIn stochastic Multiple-Play Multi-Armed Bandits (MP-MAB), a learner can play a subset of arms in\neach round known as superarm (Anantharam et al., 1987). The size of each superarm is \ufb01xed (and\nknown). The mean loss of a superarm is the sum of the means of its constituent arms. In each round,\nthe learner plays a superarm and observes the loss from each arm played (semi-bandit feedback). The\ngoal of the learner is to select a superarm that has the smallest mean loss. A policy in MP-MAB\nselects a superarm in each round based on the past information. Its performance is measured in terms\nof regret de\ufb01ned as the difference between cumulative loss incurred by the policy and that incurred\nby playing an optimal superarm in each round. Let (\u00b5, m) \u2208 [0, 1]K \u00d7 N+ denote an instance of\nMP-MAB where \u00b5 denotes the mean loss vector, and m \u2264 K denotes the size of each superarm.\nLet P c\nCSB \u2282 PCSB denote the set of CSB instances with the same threshold for all arms. For any\n(\u00b5, \u03b8c, Q) \u2208 P c\nCSB with K arms and known threshold \u03b8c, let (\u00b5, m) be an instance of MP-MAB\nwith K arms and each arm has the same Bernoulli distribution as the corresponding arm in the CSB\ninstance with m = K \u2212 M, where M = min{(cid:98)Q/\u03b8c(cid:99) , K} as earlier. Let PMP denote the set of\nresulting MP-MAB problems and f : PCSB \u2192 PMP denote the above transformation.\nLet \u03c0 be a policy on PMP. \u03c0 can also be applied on any (\u00b5, \u03b8c, Q) \u2208 P c\nCSB with known \u03b8c\nin round t, let the information\nto decide which set of arms are allocated resource as follows:\n(L1, Y1, L2, Y2, . . . , Lt\u22121, Yt\u22121) collected on a CSB instance, where Ls is the set of K \u2212 M arms\nwhere no resource is applied and Ys is the samples observed from these arms. In round t, this\ninformation is given to \u03c0 which returns a set Lt with K \u2212 M elements. Then all arms other than arms\nin Lt are given resource \u03b8c. Let this policy on (\u00b5, \u03b8c, Q) \u2208 P c\nCSB be denoted as \u03c0(cid:48). In a similar way a\npolicy \u03b2 on PCSB can be adapted to yield a policy for PMP as follows: in round t, let the information\n(S1, Y1, S2, Y2, . . . , St\u22121, Yt\u22121) collected on an MP-MAB instance, where Ss is the superarm played\nin round s and Ys is the associated loss observed from each arms in Ss, is given to \u03c0 which returns a\nset St of K \u2212 M arms where no resources has to be applied. The superarm corresponding to St is\nthen played. Let this policy on PMP be denoted as \u03b2(cid:48). Note that when \u03b8c is known, the mapping is\ninvertible. The next proposition gives regret equivalence between the MP-MAB problem and CSB\nproblem with a known same threshold.\nProposition 1. Let P := (\u00b5, \u03b8c, Q) \u2208 P c\nCSB with known \u03b8c then the regret of \u03c0 on P is same as the\nregret of \u03c0(cid:48) on f (P ). Similarly, let P (cid:48) := (\u00b5, m) \u2208 PMP, then the regret of a policy \u03b2 on P (cid:48) is same\nas the regret of \u03b2(cid:48) on f\u22121(P (cid:48)). Thus the set PCSB with a known \u03b8c is \u2019regret equivalent\u2019 to PMP, i.e.,\nR(P c\nLower bound: As a consequence of the above equivalence and one-to-one correspondence between\nthe MP-MAB and CSB with the same known threshold, a lower bound on MP-MAB is also a lower\nbound on CSB with the same threshold. Hence the following lower bound given for any strongly\nconsistent algorithm (Anantharam et al., 1987, Theorem 3.1) is also a lower bound on the CSB\nproblem with the same threshold:\n\nCSB) = R(PMP).\n\n\u2265(cid:80)\n\nE[RT ]\nlog T\n\nlim\nT\u2192\u221e\n\ni\u2208[K]\\[K\u2212M ]\n\n\u00b5i \u2212 \u00b5K\u2212M\nd(\u00b5K\u2212M , \u00b5i)\n\n,\n\n(3)\n\nwhere d(p, q) is the Kullback-Leibler (KL) divergence between two Bernoulli distributions with\nparameter p and q. Also note that we are in loss setting.\nThe above proposition suggests that any algorithm which works well for the MP-MAB also works\nwell for the CSB once the threshold is known. Hence one can use algorithms like MP-TS (Komiyama\net al., 2015) and ESCB (Combes et al., 2015) once an allocation equivalent to \u03b8c is found. MP-TS\nuses Thompson Sampling, whereas ESCB uses UCB (Upper Con\ufb01dence Bound) and KL-UCB type\nindices. One can use any of these algorithms. But we adapt MP-TS to our setting as it gives the better\nempirical performance and is shown to achieve optimal regret bound for Bernoulli distributions.\n\n4\n\n\f3.2 Algorithm: CSB-ST\n\nWe develop an algorithm named CSB-ST for solving the Censored Semi Bandits problem with Same\nThreshold. It exploits the result in Lemma 1 and equivalence established in Proposition 1 to learn a\ngood estimate of threshold and minimize the regret using a MP-MAB algorithm. CSB-ST consists of\ntwo phases, namely, threshold estimation and regret minimization.\n\nCSB-ST Algorithm for solving the Censored Semi Bandits problem with Same Threshold\n\nInput: K, Q, \u03b4, \u0001\n\\\\ Threshold Estimation Phase \\\\\n\n1: Initialize C = 0, l = 0, u = |\u0398|, i = (cid:100)u/2(cid:101)\n2: Set \u0398 as Lemma 1, T\u03b8s = 0, W\u03b4 = log(log2(|\u0398|)/\u03b4)/(max{1,(cid:98)Q(cid:99)} log(1/(1 \u2212 \u0001)))\n3: while i (cid:54)= u do\nSet \u02c6\u03b8c = \u0398[i]\n4:\nAt \u2190 \ufb01rst Q/\u02c6\u03b8c arms. Allocate \u02c6\u03b8c resource to each arm i \u2208 At\n5:\n6:\n7:\n8:\n9: end while\n\nIf loss observed for any arm i \u2208 At then set l = i, i = l +(cid:6) u\u2212l\nIf C = W\u03b4 then set u = i, i = u \u2212(cid:4) u\u2212l\n\n(cid:5) , C = 0\n\n(cid:7) , C = 0 else C = C + 1\n\nT\u03b8s = T\u03b8s + 1\n\n2\n\n2\n\n\\\\ Regret Minimization Phase \\\\\n\n10: Set M = Q/\u02c6\u03b8c and \u2200i \u2208 [K] : Si = 1, Fi = 1\n11: for t = T\u03b8s + 1, T\u03b8s + 2, . . . , T do\n12:\n13:\n14:\n15:\n16: end for\n\n\u2200i \u2208 [K] : \u02c6\u00b5i(t) = \u03b2(Si, Fi)\nLt \u2190 (K-M) arms with smallest estimates\n\u2200i \u2208 Lt : allocate no resource to arm i. \u2200j \u2208 K \\ Lt : allocate \u02c6\u03b8c resources to arm j\n\u2200i \u2208 Lt : observe Xt,i. Update Si = Si + Xt,i and Fi = Fi + 1 \u2212 Xt,i\n\nThreshold Estimation Phase: This phase \ufb01nds a threshold \u02c6\u03b8c that is allocation equivalent to\nthe underlying threshold \u03b8c with high probability by doing a binary search over the set \u0398 =\n{Q/K, Q/(K \u2212 1), . . . , min{1, Q}}. The elements of \u0398 are arranged in increasing order and\nare candidates for \u03b8c. The search starts by taking \u02c6\u03b8c to be the middle element in \u0398 and allocating\n\u02c6\u03b8c resource to the \ufb01rst Q/\u02c6\u03b8c arms (denoted as At in Line 5). If a loss is observed at any of these\narms, it implies that \u02c6\u03b8c is an underestimate of \u03b8c. All the candidates lower than the current value\nof \u02c6\u03b8c in \u0398 are eliminated, and the search is repeated in the remaining half of the elements again by\nstarting with the middle element (Line 6). If no loss is observed for consecutive W\u03b4 rounds, then \u02c6\u03b8i is\npossibly an overestimate. Accordingly, all the candidates larger than the current value of \u02c6\u03b8c in \u0398 are\neliminated, and the search is repeated starting with the middle element in the remaining half (Line 7).\nThe variable C keeps track of the number of the consecutive rounds for which no loss is observed. It\nchanges to 0 either after observing a loss or if no loss is observed for consecutive W\u03b4 rounds.\nNote that if \u02c6\u03b8c is an underestimate and no loss is observed for consecutive W\u03b4 rounds, then \u02c6\u03b8c will be\nreduced, which leads to a wrong estimate of \u02c6\u03b8c. To avoid this, we set the value of W\u03b4 such that the\nprobability of happening of such an event is upper bounded by \u03b4. The next lemma gives a bound on\nthe number of rounds needed to \ufb01nd allocation equivalent for threshold \u03b8c with high probability.\nLemma 2. Let (\u00b5, \u03b8c, Q) be an CSB instance such that \u00b51 \u2265 \u0001 > 0. Then with probability at\nleast 1 \u2212 \u03b4, the number of rounds needed by the threshold estimation phase of CSB-ST to \ufb01nd the\nallocation equivalent for threshold \u03b8c is bounded as\n\nT\u03b8s \u2264\n\nlog(log2(|\u0398|)/\u03b4)\n\nmax{1,(cid:98)Q(cid:99)} log (1/(1 \u2212 \u0001))\n\nlog2(|\u0398|).\n\nOnce \u02c6\u03b8c is known, \u00b5 needs to be estimated. The resources can be allocated such that no losses are\nobserved for maximum M arms. As our goal is to minimize the mean loss, we have to select M arms\nwith highest mean loss and then allocate \u02c6\u03b8c to each of them. It is equivalent to \ufb01nd K \u2212 M arms with\n\n5\n\n\fthe least mean loss then allocate no resources to these arms and observe their losses. These losses are\nthen used for updating the empirical estimate of the mean loss of arms.\nRegret Minimization Phase: The regret minimization phase of CSB-ST adapts Multiple-Play\nThompson Sampling (MP-TS) (Komiyama et al., 2015) for our setting. It works as follows: initially\nwe set the prior distribution of each arm as the Beta distribution \u03b2(1, 1), which is same as Uniform\ndistribution on [0, 1]. Si represents the number of rounds when loss is observed whereas Fi represents\nthe number of round when loss is not observed. Let Si(t) and Fi(t) denote the values of Si and Fi in\nthe beginning of round t. In round t, a sample \u02c6\u00b5i is independently drawn from \u03b2(Si(t), Fi(t)) for\neach arm i \u2208 [K]. \u02c6\u00b5i values are ranked by their increasing values. The \ufb01rst K \u2212 M arm are assigned\nno resources (denoted as set Lt in Line 13) while each of the remaining M arms are allocated \u02c6\u03b8c\nresources. The loss Xt,i is observed for each arm i \u2208 Lt and then success and failure counts are\nupdated by setting Si = Si + Xt,i and Fi = Fi + 1 \u2212 Xt,i.\n3.2.1 Regret Upper Bound\n\nFor instance (\u00b5, \u03b8, Q) and any feasible allocation a \u2208 AQ, we de\ufb01ne \u2207a =(cid:80)K\ni <\u03b8i}(cid:1) and \u2207m = maxa\u2208AQ \u2207a. We are now ready to state the regret bounds.\n\n1{a(cid:63)\nTheorem 1. Let \u00b51 \u2265 \u0001 > 0, W\u03b4 = log(log2(|\u0398|)/\u03b4)/max{1,(cid:98)Q(cid:99)} log(1/(1 \u2212 \u0001)), \u00b5K\u2212M <\n\u00b5K\u2212M +1, and T > W\u03b4 log2(|\u0398|). Set \u03b4 = T \u2212(log T )\u2212\u03b1 in CSB-ST such that \u03b1 > 0. Then the regret\nof CSB-ST is upper bounded as\n\n(cid:0)1{ai<\u03b8i} \u2212\n\ni=1 \u00b5i\n\n(cid:16)\n\n(log T )2/3(cid:17)\n\n+\n\n(cid:88)\n\nE [RT ] \u2264 W\u03b4 log2 (|\u0398|)\u2207m + O\n\n(\u00b5i \u2212 \u00b5K\u2212M ) log T\n\n.\n\ni\u2208[K]\\[K\u2212M ]\n\nd(\u00b5K\u2212M , \u00b5i)\n\nThe \ufb01rst term in the regret bound of Theorem 1 corresponds to the length of the threshold estimation\nphase, and the remaining parts correspond to the expected regret in the regret minimization phase.\nNote that the assumption \u00b51 \u2265 \u0001 is only required to guarantee that the threshold estimation\nphase terminates in \ufb01nite number of rounds. This assumption is not needed to get the bound\non expected regret in the regret minimization phase. The assumption \u00b5K\u2212M < \u00b5K\u2212M +1 ensures\nthat Kullback-Leibler divergence in the bound is well de\ufb01ned. This assumption is also equivalent to\nassuming that the set of top-M arms is unique.\nCorollary 1. The regret of CSB-ST is asymptotically optimal.\n\nNote that W\u03b4 = O(cid:0)(log T )1\u2212\u03b1(cid:1) for any \u03b1 > 0 and \u03b4 = T \u2212(log T )\u2212\u03b1 in CSB-ST. Now the proof of\n\nCorollary 1 follows by comparing the above bound with the lower bound given in Eq. (3).\n\n4 Different Thresholds\n\nIn this section, we consider a more dif\ufb01cult problem where the threshold may not be the same for all\narms. Let KP (\u00b5, \u03b8, Q) denote a 0-1 knapsack problem with capacity Q and K items where item i\nhas weight \u03b8i and value \u00b5i. Our next result gives the optimal allocation for an instance in PCSB.\nProposition 2. Let P = (\u00b5, \u03b8, Q) \u2208 PCSB. Then the optimal allocation for P is a solution of\nKP (\u00b5, \u03b8, Q) problem.\n\nThe proof of the above proposition and computational issues of the 0-1 knapsack with fractional\nvalues are given in the supplementary. We next discuss the condition when two threshold vectors are\nallocation equivalent. Extending the de\ufb01nition of allocation equivalence to threshold vectors, we say\nthat two vectors \u03b8 and \u02c6\u03b8 are allocation equivalent if minimum mean loss in instances (\u00b5, \u03b8, Q) and\n(\u00b5, \u02c6\u03b8, Q) are the same for any loss vector \u00b5 and resource Q. This equivalence allows us to look for\nestimated thresholds within some tolerance. We need the following notations to formalize this notion.\nFor an instance P := (\u00b5, \u03b8, Q), recall that a(cid:63) = (a(cid:63)\nK) denotes the optimal allocation. Let\n\u03b8i, where r is the residual resources after the optimal allocation. De\ufb01ne \u03b3 := r/K.\nAny problem instance with \u03b3 = 0 becomes a \u2018hopeless\u2019 problem instance as the only vector that is\nallocation equivalent to \u03b8 is \u03b8 itself, which demands \u03b8i values to be estimated with full precision to\nachieve optimal allocation. However, if \u03b3 > 0, an optimal allocation can still be found with small\nerrors in the estimates of \u03b8i as shown next.\n\nr = Q\u2212(cid:80)\n\n1, . . . , a(cid:63)\n\ni \u2265\u03b8i\n\ni:a(cid:63)\n\n6\n\n\fLemma 3. Let \u03b3 > 0 and \u2200i \u2208 [K] : \u02c6\u03b8i \u2208 [\u03b8i, \u03b8i + \u03b3]. Then for any \u00b5 \u2208 [0, 1]K and Q, the \u03b8 and \u02c6\u03b8\nare allocation equivalent.\n\nThe proof follows by an application of Theorem 3.2 in Hi\ufb01 and Mhalla (2013) which gives conditions\nfor two weight vectors \u03b81 and \u03b82 to have the same solution in KP (\u00b5, \u03b81, Q) and KP (\u00b5, \u03b82, Q) for\nany \u00b5 and Q. Once we accurately estimate the threshold \u03b8 so that the estimate \u02c6\u03b8 is an allocation\nequivalent of \u03b8, the problem is equivalent to solving the KP (\u00b5, \u02c6\u03b8, Q). The latter part is equivalent\nto solving a Combinatorial Semi-Bandits as we establish next. Combinatorial Semi-Bandits are the\ngeneralization of MP-MAB, where the size of the superarms need not be the same in each round.\nProposition 3. The CSB problem with known threshold vector \u03b8 is regret equivalent to a\nCombinatorial Semi-Bandits where Oracle uses KP (\u00b5, \u03b8, Q) to identify the optimal subset of\narms.\n\n4.1 Algorithm: CSB-DT\n\nWe develop an algorithm named CSB-DT for solving the Censored Semi Bandits problem with\nDifferent Threshold. It exploits the result of Lemma 3 and equivalence established in Proposition 3 to\nlearn a good estimate of the threshold for each arm and minimizes the regret using an algorithm from\nCombinatorial Semi-Bandits. CSB-DT also consists of two phases: threshold estimation and regret\nminimization.\n\nCSB-DT Algorithm for solving the Censored Semi Bandits problem with Different Threshold\n\nInput: K, Q, \u03b4, \u0001, \u03b3\n\\\\ Threshold Estimation Phase \\\\\n\nK\u2212(cid:98)Q(cid:99) resources\n\nfor i = 1, . . . , K do\n\nelse\n\nend if\n\nend if\n\nif loss observe for arm i with \u03b8g,i = 0 and \u03b8l,i < \u02c6\u03b8i then\n\nSet \u03b8l,i = \u02c6\u03b8i, \u02c6\u03b8i = (\u03b8u,i + \u03b8l,i)/2, Ci = 0. If available allocate resource \u02c6\u03b8i\n\nIf allocated resources is \u02c6\u03b8i then reset Ci = Ci + 1\nif Ci = W\u03b4 and \u03b8g,i = 0 then\n\nSet \u03b8u,i = \u02c6\u03b8i, \u02c6\u03b8i = (\u03b8u,i + \u03b8l,i)/2, Ci = 0. If available allocate resource \u02c6\u03b8i\nIf \u03b8u,i \u2212 \u03b8l,i \u2264 \u03b3 then set \u03b8g,i = 1 and \u02c6\u03b8i = \u03b8u,i\n\n1: Initialize \u2200i \u2208 [K] : \u03b8l,i = 0, \u03b8u,i = 1, \u03b8g,i = 0, Ci = 0.\n2: Set T\u03b8d = 0, W\u03b4 = log(K log2((cid:100)1 + 1/\u03b3(cid:101))/\u03b4)/log(1/(1 \u2212 \u0001))\n3: \u2200i \u2208 [(cid:98)Q(cid:99)] : allocate \u02c6\u03b8i = 0.5 resource. \u2200j \u2208 [(cid:98)Q(cid:99) + 1, K] : allocate \u02c6\u03b8j = Q\u2212(cid:98)Q(cid:99)/2\n4: while \u03b8g,i = 0 for any i \u2208 [K] do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20: end while\n21: \u2200i \u2208 [K] : Si = 1, Fi = 1\n22: for t = T\u03b8d + 1, T\u03b8d + 2, . . . , T do\n\u2200i \u2208 [K] : \u02c6\u00b5i(t) \u2190 Beta(Si, Fi)\n23:\n24:\n\u2200i \u2208 Lt : allocate no resource to arm i. \u2200j \u2208 K \\ Lt : allocate \u02c6\u03b8j resources to arm j\n25:\n\u2200i \u2208 Lt : observe Xt,i. Update Si = Si + Xt,i and Fi = Fi + 1 \u2212 Xt,i\n26:\n27: end for\n\nLt \u2190 Oracle(cid:0)KP ( \u02c6\u00b5(t), \u02c6\u03b8, Q)(cid:1)\n\nend for\nwhile free resources are available do\n\nend while\nT\u03b8d = T\u03b8d + 1\n\n\\\\ Regret Minimization Phase \\\\\n\nAllocate \u02c6\u03b8i resources to a new randomly chosen arm i from the arms having \u03b8g,i = 1\n\nThreshold Estimation Phase: This phase \ufb01nds a threshold that is allocation equivalent of \u03b8 with\nhigh probability. This is achieved by ensuring that \u02c6\u03b8i \u2208 [\u03b8i, \u03b8i + \u03b3] for all i (Lemma 3). For each arm\n\n7\n\n\fi \u2208 [K] a binary search is performed over the interval [0, 1] by maintaining variables \u02c6\u03b8i, \u03b8l,i, \u03b8u,i, \u03b8g,i,\nand Ci where \u02c6\u03b8i is the current estimate of \u03b8i; \u03b8l,i and \u03b8u,i denote the lower and upper bound of\nthe binary search region for arm i; and \u03b8g,i indicates whether current estimate lies in the interval\n[\u03b8i, \u03b8i + \u03b3]. In each round, the threshold estimate of arms are \ufb01rst updated sequentially and then\ntested on their respective arms. Ci keeps count of consecutive rounds without no loss for \u02c6\u03b8i. It\nchanges to 0 either after observing a loss or if no loss is observed for consecutive W\u03b4 rounds.\nThe threshold estimation phase starts with allocating 0.5 resource for \ufb01rst (cid:98)Q(cid:99) arms and (Q \u2212\n(cid:98)Q(cid:99) /2)/(K \u2212 (cid:98)Q(cid:99)) for the remaining arms (Line 3). In each round, allocated resource are applied\non each arm and based on the observations, their estimates and the allocated resource are updated\nsequentially starting from 1 to K as follows. If a loss is observed for arm i having bad threshold\nestimate (\u03b8g,i = 0) and \u03b8l,i < \u02c6\u03b8i, then it implies that \u02c6\u03b8i is an underestimate of \u03b8i and the following\nactions are performed \u2013 1) lower end of search region is increased to \u02c6\u03b8i, i.e., \u03b8l,i = \u02c6\u03b8i; 2) its estimate\n\u02c6\u03b8i is set to (\u03b8u,i + \u03b8l,i)/2; 3) if available allocate \u02c6\u03b8i resource to arm i; and 4) set Ci = 0 (Line 7).\nIf no loss is observed after allocating \u02c6\u03b8i resources for W\u03b4 successive rounds for arm i with bad\nthreshold estimate, then it implies that \u02c6\u03b8i is overestimated and following actions are performed \u2013\n1) the upper end of the search region is changed to \u02c6\u03b8i, i.e, \u03b8u,i = \u02c6\u03b8i; 2) its estimate \u02c6\u03b8i is set to\n(\u03b8u,i + \u03b8l,i)/2; and 3) if available allocate \u02c6\u03b8i resource to arm i (Line 11). Further, whether goodness\nof \u02c6\u03b8i holds, i.e., \u02c6\u03b8i \u2208 [\u03b8i, \u03b8i + \u03b3] is checked by condition \u03b8u,i \u2212 \u03b8l,i \u2264 \u03b3. If the condition holds,\nthe threshold estimation of arm is within desired accuracy and this is indicated by setting \u03b8g,i to 1\nand \u02c6\u03b8i = \u03b8u,i (Line 12). Any unassigned resources are given to randomly chosen arms having good\nthreshold estimates (all arms with \u03b8g,i = 1) where each arm i gets only \u02c6\u03b8i resources (Line 17).\nThe value of W\u03b4 in CSB-DT is set such that the probability of estimated threshold does not lie in\n[\u03b8i, \u03b8i + \u03b3] for all arms is upper bounded by \u03b4. Following lemma gives the bounds on the number of\nrounds needed to \ufb01nd the allocation equivalent for threshold vector \u03b8 with high probability.\nLemma 4. Let (\u00b5, \u03b8, Q) be an instance of CSB such that \u03b3 > 0 and \u00b51 \u2265 \u0001 > 0. Then with\nprobability at least 1 \u2212 \u03b4, the number of rounds needed by threshold estimation phase of CSB-DT to\n\ufb01nd the allocation equivalent for threshold vector \u03b8 is bounded as\n\nT\u03b8d \u2264 K log(K log2((cid:100)1 + 1/\u03b3(cid:101))/\u03b4)\nmax{1,(cid:98)Q(cid:99)} log(1/(1 \u2212 \u0001))\n\nlog2((cid:100)1 + 1/\u03b3(cid:101)).\n\nRegret Minimization Phase: For this phase, we could use an algorithm that works well for the\nCombinatorial Semi-Bandits, like SDCB (Chen et al., 2016) and CTS (Wang and Chen, 2018). CTS\nuses Thompson Sampling, whereas SDCB uses the UCB type index. We adapt the CTS to our\nsetting due to better empirical performance. This phase is similar to the regret minimization phase of\nCSB-ST except that superarm to play is selected by Oracle that uses KL( \u02c6\u00b5(t), \u02c6\u03b8, Q) to identify the\narms where the learner has to allocate no resources.\n\n4.1.1 Regret Upper Bound\nLet \u2207a and \u2207m be de\ufb01ned as in Section 3.2.1. Let \u03b3 > 0, Sa = {i : ai < \u03b8i} for any feasible\nallocation a and k(cid:63) = |Sa(cid:63)|. We rede\ufb01ne W\u03b4 = log(K log2((cid:100)1 + 1/\u03b3(cid:101))/\u03b4)/ log(1/(1 \u2212 \u0001)).\nTheorem 2. Let (\u00b5, \u03b8, Q) \u2208 PCSB such that \u03b3 > 0, \u00b51 \u2265 \u0001, and T > W\u03b4log2((cid:100)1 + 1/\u03b3(cid:101)). Set \u03b4 =\nT \u2212(log T )\u2212\u03b1 in CSB-DT such that \u03b1 > 0. Then the expected regret of CSB-DT is upper bounded as\n\n(cid:18) KW\u03b4log2 ((cid:100)1 + 1/\u03b3(cid:101))\n\n(cid:19)\n\nmax{1,(cid:98)Q(cid:99)}\n\n\u2207m +\n\nE [RT ] \u2264\n\n(cid:32)\n\nK(K \u2212 Q)2\n\n\u03b72\n\n+\n\n8\u03b11\n\u03b72\n\n\uf8eb\uf8ed(cid:88)\n(cid:18) 4\n\ni\u2208[K]\n\n\u03b72 + 1\n\nmax\nSa:i\u2208Sa\n\n(cid:19)k(cid:63)\n\n8|Sa| log T\n\n\u2207a \u2212 2(k(cid:63)2 + 2)\u03b7\n\n(cid:33)\n\nlog\n\nk(cid:63)\n\u03b72 + 3K\n\n\u2207m,\n\n\uf8f6\uf8f8 +\n\nfor any \u03b7 such that \u2200a \u2208 AQ,\u2207a > 2(k(cid:63)2 + 2)\u03b7 and a problem independent constant \u03b11.\nThe \ufb01rst term of expected regret is due to the threshold estimation phase. Threshold estimation takes\nT\u03b8d rounds to complete, and \u2207m is the maximum regret that can be incurred in any round. Then the\n\n8\n\n\fmaximum regret due to threshold estimation is bounded by T\u03b8d\u2207m. The remaining terms correspond\nto the regret due to the regret minimization phase. Further, the expected regret of CSB-DT can be\nshown be E [RT ] \u2264 O(K log T /\u2207min), where \u2207min is the minimum gap between the mean loss of\noptimal allocation and any non-optimal allocation.\n\n5 Experiments\n\nWe ran computer simulations to evaluate the empirical performance of proposed algorithms. Our\nsimulations involve two synthetically generated instances. In Instance 1, the threshold is the same for\nall arm, whereas in Instance 2, it varies across arms. The details of the instances are as follows:\nInstance 1 (Identical Threshold): It has K = 20, Q = 7, \u03b8c = 0.7, \u03b4 = 0.1, \u0001 = 0.1 and\nT = 10000. The loss of arm i is Bernoulli distribution with parameter 0.25 + (i \u2212 1)/50.\nInstance 2 (Different Thresholds): It has K = 5, Q = 2, \u03b4 = 0.1, \u0001 = 0.1, \u03b3 = 10\u22123 and\nT = 5000. The mean loss vector is \u00b5 = [0.9, 0.89, 0.87, 0.58, 0.3] and corresponding threshold\nvector is \u03b8 = [0.7, 0.7, 0.7, 0.6, 0.35]. The loss of arm i is Bernoulli distributed with parameter \u00b5i.\nFor Instance 1, we only varied the number of resource Q and observed the regret of CSB-ST as given\nin Fig. 1. We observe that when resources are small, the learner can allocate resources to a few arms\nbut observes loss from more arms. On the other hand, when resources are more, the learner allocates\nresources to more arms and observes loss from fewer arms. Thus as resources increase, we move\nfrom semi-bandit feedback to bandit feedback and hence regret increases with increase in the amount\nof resources. Next, we only varied \u03b8c in Instance 1, and the regret of CSB-ST is shown in Fig. 2.\nSimilar trends are observed as the decrease in threshold leads to an increase in the number of arms\nthat can be allocated resources and vice-versa. Therefore the amount of feedback decreases as the\nthreshold decreases and leads to more regret. We repeated the experiment 100 times and plotted the\nregret with a 95% con\ufb01dence interval (the vertical line on each curve shows the con\ufb01dence interval).\nThe empirical results validate sub-linear bounds for our algorithms.\n\nFigure 2: Varying value of same threshold.\n\nFigure 1: Varying amount of resources.\nFigure 3: UCB and TS based Algorithms.\nWe also compare the performance of CSB-DT against the CSB-DT-UCB algorithm, which uses the\nUCB type index as used in the SDCB algorithm (Chen et al., 2016) on Instance 2. As shown in\nFig. 3, as expected, Thompson Sampling (TS) based CSB-DT outperforms its UCB based counterpart\nCSB-DT-UCB. The pseudo-code of CSB-DT-UCB is given in the supplementary material.\n\n6 Conclusion and Future Extensions\n\nIn this work, we proposed a novel framework for resource allocation problems using a variant of\nsemi-bandits named censored semi-bandits. In our setup, loss observed from an arm depends on the\namount of resource allocated, and hence, the loss can be censored. We consider a threshold-based\nmodel where loss from an arm is observed when allocated resource is below a threshold. The goal is\nto assign a given resource to arms such that total expected loss is minimized. We considered two\nvariants of the problem, depending on whether or not the thresholds are the same across the arms. For\nthe variant where thresholds are the same across the arms, we established that it is equivalent to the\nMultiple-Play Multi-Armed Bandit problem. For the second variant where threshold can depend on\nthe arm, we established that it is equivalent to a more general Combinatorial Semi-Bandit problem.\nExploiting these equivalences, we developed algorithms that enjoy optimal performance guarantees.\nWe decoupled the problem of threshold and mean loss estimation. It would be interesting to explore\nif this can be done jointly, leading to better performance guarantees. Another new extension of work\nis to relax the assumptions that mean losses are strictly positive, and time horizon T is known.\n\n9\n\n\fAcknowledgments\n\nArun Verma would like to thank travel support from Google and NeurIPS. Manjesh K. Hanawal\nwould like to thank the support from INSPIRE faculty fellowships from DST, Government of India,\nSEED grant (16IRCCSG010) from IIT Bombay, and Early Career Research (ECR) Award from\nSERB. Initial discussions of this work were done when Raman Sankaran was at Conduent Labs India.\n\nReferences\nKevin M Curtin, Karen Hayslett-McCall, and Fang Qiu. Determining optimal police patrol areas\nwith maximal covering and backup covering location models. Networks and Spatial Economics,\n10(1):125\u2013145, 2010.\n\nNicole Adler, Alfred Shalom Hakkert,\n\nJonathan Kornbluth, Tal Raviv, and Mali Sher.\nLocation-allocation models for traf\ufb01c police patrol vehicles on an interurban network. Annals of\nOperations Research, 221(1):9\u201331, 2014.\n\nAriel Rosenfeld and Sarit Kraus. When security games hit traf\ufb01c: Optimal traf\ufb01c enforcement under\n\none sided uncertainty. In IJCAI, pages 3814\u20133822, 2017.\n\nThanh H Nguyen, Arunesh Sinha, Shahrzad Gholami, Andrew Plumptre, Lucas Joppa, Milind\nTambe, Margaret Driciru, Fred Wanyama, Aggrey Rwetsiba, Rob Critchlow, and Colin M. Beale.\nIn Proceedings of the\nCapture: A new predictive anti-poaching tool for wildlife protection.\n2016 International Conference on Autonomous Agents & Multiagent Systems, pages 767\u2013775.\nInternational Foundation for Autonomous Agents and Multiagent Systems, 2016.\n\nShahrzad Gholami, Sara Mc Carthy, Bistra Dilkina, Andrew Plumptre, Milind Tambe, Margaret\nDriciru, Fred Wanyama, Aggrey Rwetsiba, Mustapha Nsubaga, Joshua Mabonga, Tom Okello, and\nEric Enyel. Adversary models account for imperfect crime data: Forecasting and planning against\nreal-world poachers. In Proceedings of the 17th International Conference on Autonomous Agents\nand MultiAgent Systems, pages 823\u2013831. International Foundation for Autonomous Agents and\nMultiagent Systems, 2018.\n\nJacob D Abernethy, Kareem Amin, and Ruihao Zhu. Threshold bandits, with and without censored\n\nfeedback. In Advances In Neural Information Processing Systems, pages 4889\u20134897, 2016.\n\nTor Lattimore, Koby Crammer, and Csaba Szepesv\u00e1ri. Optimal resource allocation with semi-bandit\nfeedback. In Proceedings of the Thirtieth Conference on Uncertainty in Arti\ufb01cial Intelligence,\npages 477\u2013486. AUAI Press, 2014.\n\nChao Zhang, Victor Bucarey, Ayan Mukhopadhyay, Arunesh Sinha, Yundi Qian, Yevgeniy\nVorobeychik, and Milind Tambe. Using abstractions to solve opportunistic crime security games at\nscale. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent\nSystems, pages 196\u2013204, 2016.\n\nArunesh Sinha, Fei Fang, Bo An, Christopher Kiekintveld, and Milind Tambe. Stackelberg security\n\ngames: Looking beyond a decade of success. In IJCAI, pages 5494\u20135501, 2018.\n\nChao Zhang, Arunesh Sinha, and Milind Tambe. Keeping pace with criminals: Designing patrol\nallocation against adaptive opportunistic criminals. In Proceedings of the 2015 international\nconference on Autonomous agents and multiagent systems, pages 1351\u20131359, 2015.\n\nAshwinkumar Badanidiyuru, Robert Kleinberg, and Aleksandrs Slivkins. Bandits with knapsacks.\n\nJournal of the ACM (JACM), 65(3):13, 2018.\n\nLalit Jain and Kevin Jamieson. Firing bandits: Optimizing crowdfunding. In International Conference\n\non Machine Learning, pages 2211\u20132219, 2018.\n\nTor Lattimore, Koby Crammer, and Csaba Szepesv\u00e1ri. Linear multi-resource allocation with\nsemi-bandit feedback. In Advances in Neural Information Processing Systems, pages 964\u2013972,\n2015.\n\n10\n\n\fYuval Dagan and Koby Crammer. A better resource allocation algorithm with semi-bandit feedback.\n\nIn Algorithmic Learning Theory, pages 268\u2013320, 2018.\n\nNicolo Cesa-Bianchi and G\u00e1bor Lugosi. Combinatorial bandits. Journal of Computer and System\n\nSciences, 78(5):1404\u20131422, 2012.\n\nWei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and\n\napplications. In International Conference on Machine Learning, pages 151\u2013159, 2013.\n\nArun Rajkumar and Shivani Agarwal. Online decision-making in general combinatorial spaces. In\n\nAdvances in Neural Information Processing Systems, pages 3482\u20133490, 2014.\n\nRichard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, and Marc Lelarge.\nCombinatorial bandits revisited. In Advances in Neural Information Processing Systems, pages\n2116\u20132124, 2015.\n\nWei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, and Pinyan Lu. Combinatorial multi-armed bandit\nwith general reward functions. In Advances in Neural Information Processing Systems, pages\n1659\u20131667, 2016.\n\nSiwei Wang and Wei Chen. Thompson sampling for combinatorial semi-bandits. In International\n\nConference on Machine Learning, pages 5101\u20135109, 2018.\n\nJunpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Optimal Regret Analysis of Thompson\nIn International\n\nSampling in Stochastic Multi-armed Bandit Problem with Multiple Plays.\nConference on Machine Learning, pages 1152\u20131161, 2015.\n\nNicolo Cesa-Bianchi, G\u00e1bor Lugosi, and Gilles Stoltz. Regret minimization under partial monitoring.\n\nMathematics of Operations Research, 31(3):562\u2013580, 2006.\n\nG\u00e1bor Bart\u00f3k and Csaba Szepesv\u00e1ri. Partial monitoring with side information. In International\n\nConference on Algorithmic Learning Theory, pages 305\u2013319. Springer, 2012.\n\nG\u00e1bor Bart\u00f3k, Dean P Foster, D\u00e1vid P\u00e1l, Alexander Rakhlin, and Csaba Szepesv\u00e1ri. Partial\nmonitoring\u2014classi\ufb01cation, regret bounds, and algorithms. Mathematics of Operations Research,\n39(4):967\u2013997, 2014.\n\nVenkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. Asymptotically Ef\ufb01cient Allocation\nIEEE\n\nRules for the Multiarmed Bandit Problem with Multiple Plays-Part I: I.I.D. Rewards.\nTransactions on Automatic Control, 32(11):968\u2013976, 1987.\n\nMhand Hi\ufb01 and Hedi Mhalla. Sensitivity analysis to perturbations of the weight of a subset of items:\n\nThe knapsack case study. Discrete Optimization, 10(4):320\u2013330, 2013.\n\n11\n\n\f", "award": [], "sourceid": 8223, "authors": [{"given_name": "Arun", "family_name": "Verma", "institution": "Indian Institute of Technology Bombay"}, {"given_name": "Manjesh", "family_name": "Hanawal", "institution": "Indian Institute of Technology Bombay"}, {"given_name": "Arun", "family_name": "Rajkumar", "institution": "Indian Institute of Technology Madras"}, {"given_name": "Raman", "family_name": "Sankaran", "institution": "LinkedIn"}]}