{"title": "Batched Multi-armed Bandits Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 503, "page_last": 513, "abstract": "In this paper, we study the multi-armed bandit problem in the batched setting where the employed policy must split data into a small number of batches. While the minimax regret for the two-armed stochastic bandits has been completely characterized in \\cite{perchet2016batched}, the effect of the number of arms on the regret for the multi-armed case is still open. Moreover, the question whether adaptively chosen batch sizes will help to reduce the regret also remains underexplored. In this paper, we propose the BaSE (batched successive elimination) policy to achieve the rate-optimal regrets (within logarithmic factors) for batched multi-armed bandits, with matching lower bounds even if the batch sizes are determined in an adaptive manner.", "full_text": "Batched Multi-armed Bandits Problem\n\nZijun Gao, Yanjun Han, Zhimei Ren, Zhengqing Zhou\n\nDepartment of {Statistics, Electrical Engineering, Statistics, Mathematics}\n\nStanford University\n\n{zijungao,yjhan,zren,zqzhou}@stanford.edu\n\nAbstract\n\nIn this paper, we study the multi-armed bandit problem in the batched setting\nwhere the employed policy must split data into a small number of batches. While\nthe minimax regret for the two-armed stochastic bandits has been completely\ncharacterized in [PRCS16], the effect of the number of arms on the regret for the\nmulti-armed case is still open. Moreover, the question whether adaptively chosen\nbatch sizes will help to reduce the regret also remains underexplored. In this\npaper, we propose the BaSE (batched successive elimination) policy to achieve the\nrate-optimal regrets (within logarithmic factors) for batched multi-armed bandits,\nwith matching lower bounds even if the batch sizes are determined in an adaptive\nmanner.\n\n1\n\nIntroduction and Main Results\n\nBatch learning and online learning are two important aspects of machine learning, where the learner\nis a passive observer of a given collection of data in batch learning, while he can actively determine\nthe data collection process in online learning. Recently, the combination of these learning procedures\nhas been arised in an increasing number of applications, where the active querying of data is possible\nbut limited to a \ufb01xed number of rounds of interaction. For example, in clinical trials [Tho33, Rob52],\ndata come in batches where groups of patients are treated simultaneously to design the next trial. In\ncrowdsourcing [KCS08], it takes the crowd some time to answer the current queries, so that the total\ntime constraint imposes restrictions on the number of rounds of interaction. Similar problems also\narise in marketing [BM07] and simulations [CG09].\nIn this paper we study the in\ufb02uence of round constraints on the learning performance via the following\nbatched multi-armed bandit problem. Let I = {1, 2,\u00b7\u00b7\u00b7 , K} be a given set of K \u2265 2 arms of a\nstochastic bandit, where successive pulls of an arm i \u2208 I yields rewards which are i.i.d. samples from\ndistribution \u03bd(i) with mean \u00b5(i). Throughout this paper we assume that the reward follows a Gaussian\ndistribution, i.e., \u03bd(i) = N (\u00b5(i), 1), where generalizations to general sub-Gaussian rewards and\nvariances are straightforward. Let \u00b5(cid:63) = maxi\u2208[K] \u00b5(i) be the expected reward of the best arm, and\n\u2206i = \u00b5(cid:63) \u2212 \u00b5(i) \u2265 0 be the gap between arm i and the best arm. The entire time horizon T is splitted\ninto M batches represented by a grid T = {t1,\u00b7\u00b7\u00b7 , tM}, with 1 \u2264 t1 < t2 < \u00b7\u00b7\u00b7 < tM = T , where\nthe grid belongs to one of the following categories:\n\n1. Static grid: the grid T = {t1,\u00b7\u00b7\u00b7 , tM} is \ufb01xed ahead of time, before sampling any arms;\n2. Adaptive grid: for j \u2208 [M ], the grid value tj may be determined after observing the rewards\n\nup to time tj\u22121 and using some external randomness.\n\nNote that the adaptive grid is more powerful and practical than the static one, and we recover batch\nlearning and online learning by setting M = 1 and M = T , respectively. A sampling policy\nt=1 is a sequence of random variables \u03c0t \u2208 [K] indicating which arm to pull at time t \u2208 [T ],\n\u03c0 = (\u03c0t)T\nwhere for tj\u22121 < t \u2264 tj, the policy \u03c0t depends only on observations up to time tj\u22121. In other words,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe policy \u03c0t only depends on observations strictly anterior to the current batch of t. The ultimate\ngoal is to devise a sampling policy \u03c0 to minimize the expected cumulative regret (or pseudo-regret,\nor simply regret), i.e., to minimize E[RT (\u03c0)] where\n\nRT (\u03c0) (cid:44) T(cid:88)\n\n(cid:16)\n\n\u00b5(cid:63) \u2212 \u00b5(\u03c0t)(cid:17)\n\n= T \u00b5(cid:63) \u2212 T(cid:88)\n\n\u00b5(\u03c0t).\n\nLet \u03a0M,T be the set of policies with M batches and time horizon T , our objective is to characterize\nthe following minimax regret and problem-dependent regret under the batched setting:\n\nt=1\n\nt=1\n\nmin-max(K, M, T ) (cid:44) inf\nR(cid:63)\n\u03c0\u2208\u03a0M,T\npro-dep(K, M, T ) (cid:44) inf\nR(cid:63)\n\u03c0\u2208\u03a0M,T\n\nsup\ni=1:\u2206i\u2264\u221a\n{\u00b5(i)}K\n\u2206 \u00b7\n\nsup\n\u2206>0\n\nE[RT (\u03c0)],\n\nK\n\n{\u00b5(i)}K\n\nsup\n\n\u221a\ni=1:\u2206i\u2208{0}\u222a[\u2206,\n\n(1)\n\n(2)\n\nE[RT (\u03c0)].\n\nK]\n\nbound on the minimum gaps is present in the problem-dependent regret. The constraint \u2206i \u2264 \u221a\nNote that the gaps between arms can be arbitrary in the de\ufb01nition of the minimax regret, while a lower\nK is\na technical condition in both scenarios, which is more relaxed than the usual condition \u2206i \u2208 [0, 1].\nThese quantities are motivated by the fact that, when M = T , the upper bounds of the regret for\nmulti-armed bandits usually take the form [Vog60, LR85, AB09, BPR13, PR13]\n\nE[RT (\u03c01)] \u2264 C\nE[RT (\u03c02)] \u2264 C\n\n\u221a\n\nKT ,\n\n(cid:88)\n\nmax{1, log(T \u22062\ni )}\n\n,\n\ni\u2208[K]:\u2206i>0\n\n\u2206i\n\nwhere \u03c01, \u03c02 are some policies, and C > 0 is an absolute constant. These bounds are also tight in\nthe minimax sense [LR85, AB09]. As a result, in the fully adaptive setting (i.e., when M = T ), we\nhave the optimal regrets R(cid:63)\npro-dep(K, T, T ) = \u0398(K log T ). The\ntarget is to \ufb01nd the dependence of these quantities on the number of batches M.\nOur \ufb01rst result tackles the upper bounds on the minimax regret and problem-dependent regret.\nTheorem 1. For any K \u2265 2, T \u2265 1, 1 \u2264 M \u2264 T , there exist two policies \u03c01 and \u03c02 under static\ngrids (explicitly de\ufb01ned in Section 2) such that if maxi\u2208[K] \u2206i \u2264 \u221a\n\nmin-max(K, T, T ) = \u0398(\n\nKT ), and R(cid:63)\n\nK, we have\n\n\u221a\n\n\u221a\n\nE[RT (\u03c01)] \u2264 polylog(K, T ) \u00b7\nE[RT (\u03c02)] \u2264 polylog(K, T ) \u00b7 KT 1/M\nmini(cid:54)=(cid:63) \u2206i\n\nKT\n\n1\n\n,\n\n2\u221221\u2212M ,\n\nwhere polylog(K, T ) hides poly-logarithmic factors in (K, T ).\n\nThe following corollary is immediate.\n\u221a\nCorollary 1. For the M-batched K-armed bandit problem with time horizon T , it is suf\ufb01cient to have\nM = O(log log T ) batches to achieve the optimal minimax regret \u0398(\nKT ), and M = O (log T ) to\nachieve the optimal problem-dependent regret \u0398(K log T ), where both optimal regrets are within\nlogarithmic factors.\n\nFor the lower bounds of the regret, we treat the static grid and the adaptive grid separately. The next\ntheorem presents the lower bounds under any static grid.\nTheorem 2. For the M-batched K-armed bandit problem with time horizon T and any static grid,\nthe minimax and problem-dependent regrets can be lower bounded as\n\n\u221a\nmin-max(K, M, T ) \u2265 c \u00b7\nR(cid:63)\nKT\npro-dep(K, M, T ) \u2265 c \u00b7 KT\n1\nR(cid:63)\nM ,\nwhere c > 0 is a numerical constant independent of K, M, T .\n\n1\n\n2\u221221\u2212M ,\n\nWe observe that for any static grids, the lower bounds in Theorem 2 match those in Theorem 1\nwithin poly-logarithmic factors. For general adaptive grids, the following theorem shows regret lower\nbounds which are slightly weaker than Theorem 2.\n\n2\n\n\fTheorem 3. For the M-batched K-armed bandit problem with time horizon T and any adaptive\ngrid, the minimax and problem-dependent regrets can be lower bounded as\n\n\u221a\nmin-max(K, M, T ) \u2265 cM\u22122 \u00b7\nR(cid:63)\nKT\npro-dep(K, M, T ) \u2265 cM\u22122 \u00b7 KT\nR(cid:63)\nwhere c > 0 is a numerical constant independent of K, M, T .\n\n1\n\n1\n\n2\u221221\u2212M ,\n\nM ,\n\nCompared with Theorem 2, the lower bounds in Theorem 3 lose a polynomial factor in M due\nto a larger policy space. However, since the number of batches M of interest is at most O(log T )\n(otherwise by Corollary 1 we effectively arrive at the fully adaptive case with M = T ), this penalty\nis at most poly-logarithmic in T . Consequently, Theorem 3 shows that for any adaptive grid, albeit\nconceptually more powerful, its performance is essentially no better than that of the best static grid.\nSpeci\ufb01cally, we have the following corollary.\nCorollary 2. For the M-batched K-armed bandit problem with time horizon T with either static or\n\u221a\nadaptive grids, it is necessary to have M = \u2126(log log T ) batches to achieve the optimal minimax\nregret \u0398(\nKT ), and M = \u2126 (log T / log log T ) to achieve the optimal problem-dependent regret\n\u0398(K log T ), where both optimal regrets are within logarithmic factors.\n\nIn summary, the above results have completely characterized the minimax and problem-dependent\nregrets for batched multi-armed bandit problems, within logarithmic factors. It is an outstanding open\nquestion whether the M\u22122 term in Theorem 3 can be removed using more re\ufb01ned arguments.\n\n1.1 Related works\n\nThe multi-armed bandits problem is an important class of sequential optimization problems which\nhas been extensively studied in various \ufb01elds such as statistics, operations research, engineering,\ncomputer science and economics over the recent years [BCB12]. In the fully adaptive scenario, the\nregret analysis for stochastic bandits can be found in [Vog60, LR85, BK97, ACBF02, AB09, AMS09,\nAB10, AO10, GC11, BPR13, PR13].\nThere is less attention on the batched setting with limited rounds of interaction. The batched setting\nis studied in [CBDS13] under the name of switching costs, where it is shown that O(log log T )\nbatches are suf\ufb01cient to achieve the optimal minimax regret. For small number of batches M, the\nbatched two-armed bandit problem is studied in [PRCS16], where the results of Theorems 1 and 2 are\nobtained for K = 2. However, the generalization to the multi-armed case is not straightforward, and\nmore importantly, the practical scenario where the grid is adaptively chosen based on the historic data\nis excluded in [PRCS16]. For the multi-armed case, a different problem of \ufb01nding the best k arms\nin the batched setting has been studied in [JJNZ16, AAAK17], where the goal is pure exploration,\nand the error dependence on the time horizon decays super-polynomially. We also refer to [DRY18]\nfor a similar setting with convex bandits and best arm identi\ufb01cation. The regret analysis for batched\nstochastic multi-armed bandits still remains underexplored.\nWe also review some literature on general computation with limited rounds of adaptivity, and in\nparticular, on the analysis of lower bounds.\nIn theoretical computer science, this problem has\nbeen studied under the name of parallel algorithms for certain tasks (e.g., sorting and selection)\ngiven either deterministic [Val75, BT83, AA88] or noisy outcomes [FRPU94, DKMR14, BMW16].\nIn (stochastic) convex optimization, the information-theoretic limits are typically derived under\nthe oracle model where the oracle can be queried adaptively [NY83, AWBR09, Sha13, DRY18].\nHowever, in the previous works, one usually optimizes the sampling distribution over a \ufb01xed sample\nsize at each step, while it is more challenging to prove lower bounds for policies which can also\ndetermine the sample size. There is one exception [AAAK17], whose proof relies on a complicated\ndecomposition of near-uniform distributions. Hence, our technique of proving Theorem 3 is also\nexpected to be an addition to these literatures.\n\n1.2 Organization\n\nThe rest of this paper is organized as follows. In Section 2, we introduce the BaSE policy for\ngeneral batched multi-armed bandit problems, and show that it attains the upper bounds in Theorem 1\nunder two speci\ufb01c grids. Section 3 presents the proofs of lower bounds for both the minimax and\n\n3\n\n\fproblem-dependent regrets, where Section 3.1 deals with the static grids and Section 3.2 tackles the\nadaptive grids. Experimental results are presented in Section 4. The auxiliary lemmas and the proof\nof main lemmas are deferred to supplementary materials.\n\n1.3 Notations\nFor a positive integer n, let [n] (cid:44) {1,\u00b7\u00b7\u00b7 , n}. For any \ufb01nite set A, let |A| be its cardinality. We adopt\nthe standard asymptotic notations: for two non-negative sequences {an} and {bn}, let an = O(bn)\niff lim supn\u2192\u221e an/bn < \u221e, an = \u2126(bn) iff bn = O(an), and an = \u0398(bn) iff an = O(bn)\nand bn = O(an). For probability measures P and Q, let P \u2297 Q be the product measure with\nmarginals P and Q. If measures P and Q are de\ufb01ned on the same probability space, we denote\nby TV(P, Q) = 1\ndQ the total variation distance and\n2\nKullback\u2013Leibler (KL) divergences between P and Q, respectively.\n\n(cid:82) |dP \u2212 dQ| and DKL(P(cid:107)Q) = (cid:82) dP log dP\n\n2 The BaSE Policy\n\nIn this section, we propose the BaSE policy for the batched multi-armed bandit problem based on\nsuccessive elimination, as well as two choices of the static grids to prove Theorem 1.\n\n2.1 Description of the policy\n\nThe policy that achieves the optimal regrets is essentially adapted from Successive Elimination (SE).\nThe original version of SE was introduced in [EDMM06], and [PR13] shows that in the M = T case\nSE achieves both the optimal minimax and problem-dependent rates. Here we introduce a batched\nversion of SE called Batched Successive Elimination (BaSE) to handle the general case M \u2264 T .\nGiven a pre-speci\ufb01ed grid T = {t1,\u00b7\u00b7\u00b7 , tM}, the idea of the BaSE policy is simply to explore\nin the \ufb01rst M \u2212 1 batches and then commit to the best arm in the last batch. At the end of each\nexploration batch, we remove arms which are probably bad based on past observations. Spec\ufb01cally,\nlet A \u2286 I denote the active arms that are candidates for the optimal arm, where we initialize A = I\nand sequentially drop the arms which are \u201csigni\ufb01cantly\u201d worse than the \u201cbest\u201d one. For the \ufb01rst\nM \u2212 1 batches, we pull all active arms for a same number of times (neglecting rounding issues1) and\neliminate some arms from A at the end of each batch. For the last batch, we commit to the arm in A\nwith maximum average reward.\nBefore stating the exact algorithm, we introduce some notations. Let\n\n\u00afY i(t) =\n\n|{s \u2264 t : arm i is pulled at time s}|\n\n1\n\nYs1{arm i is pulled at time s}\n\ndenote the average rewards of the arm i up to time t, and \u03b3 > 0 is a tuning parameter associated with\nthe UCB bound. The algorithm is described in detail in Algorithm 1.\nNote that the BaSE algorithm is not fully speci\ufb01ed unless the grid T is determined. Here we provide\ntwo choices of static grids which are similar to [PRCS16] as follows: let\n\nt(cid:88)\n\ns=1\n\n\u221a\num\u22121, m = 2,\u00b7\u00b7\u00b7 , M,\nu1 = a, um = a\nm\u22121, m = 2,\u00b7\u00b7\u00b7 , M,\nm = bu(cid:48)\nu(cid:48)\n1 = b, u(cid:48)\n\nwhere parameters a, b are chosen appropriately such that tM = t(cid:48)\n\ntm = (cid:98)um(cid:99), m \u2208 [M ],\nm = (cid:98)u(cid:48)\nm(cid:99), m \u2208 [M ],\nt(cid:48)\n(cid:17)\n(cid:16)\n\nM = T , i.e.,\n\n(cid:16)\n\n1\n\n2\u221221\u2212M\n\nT\n\n(cid:17)\n\n1\nM\n\nT\n\n.\n\n,\n\nb = \u0398\n\na = \u0398\n\nM}. We will denote by \u03c01\n\n(3)\nFor minimizing the minimax regret, we use the \u201cminimax\u201d grid de\ufb01ned by Tminimax = {t1,\u00b7\u00b7\u00b7 , tM};\nas for the problem-dependent regret, we use the \u201cgeometric\" grid which is de\ufb01ned by Tgeometric =\n1,\u00b7\u00b7\u00b7 , t(cid:48)\n{t(cid:48)\n1There might be some rounding issues here, and some arms may be pulled once more than others. In this\ncase, the additional pull will not be counted towards the computation of the average reward \u00afY i(t), which ensures\nthat all active arms are evaluated using the same number of pulls at the end of any batch. Note that in this way,\nthe number of pulls for each arm is underestimated by at most half, therefore the regret analysis in Theorem 4\nwill give the same rate in the presence of rounding issues.\n\nBaSE the respective policies under these grids.\n\nBaSE and \u03c02\n\n4\n\n\fAlgorithm 1: Batched Successive Elimination (BaSE)\nInput: Arms I = [K]; time horizon T ; number of batches M; grid T = {t1, ..., tM}; tuning\nparameter \u03b3.\nInitialization: A \u2190 I.\nfor m \u2190 1 to M \u2212 1 do\n\n(a) During the period [tm\u22121 + 1, tm], pull an arm from A for a same number of times.\n(b) At time tm:\nLet \u00afY max(tm) = maxj\u2208A \u00afY j(tm), and \u03c4m be the total number of pulls of each arm in A.\nfor i \u2208 A do\n\nif \u00afY max(tm) \u2212 \u00afY i(tm) \u2265(cid:112)\u03b3 log(T K)/\u03c4m then\n\nA \u2190 A \u2212 {i}.\n\nend\n\nend\n\nend\nfor t \u2190 tM\u22121 + 1 to T do\nend\nOutput: Resulting policy \u03c0.\n\npull arm i0 such that i0 \u2208 arg maxj\u2208A \u00afY j(tM\u22121) (break ties arbitrarily).\n\n2.2 Regret analysis\n\nThe performance of the BaSE policy is summarized in the following theorem.\nTheorem 4. Consider an M-batched, K-armed bandit problem where the time horizon is T . let\nBaSE be the BaSE policy equipped with the grid Tminimax and \u03c02\n\u221a\nBaSE be the BaSE policy equipped\n\u03c01\nwith the grid Tgeometric. For \u03b3 \u2265 12 and maxi\u2208[K] \u2206i = O(\nK), we have\n\u221a\n\nBaSE)] \u2264 C log K(cid:112)log(KT ) \u00b7\n\n2\u221221\u2212M ,\n\n(4)\n\n1\n\nE[RT (\u03c01\nE[RT (\u03c02\n\nKT\nBaSE)] \u2264 C log K log(KT ) \u00b7 KT 1/M\nmini(cid:54)=(cid:63) \u2206i\n\n,\n\n(5)\n\nwhere C > 0 is a numerical constant independent of K, M and T .\n\nNote that Theorem 4 implies Theorem 1. In the sequel we sketch the proof of Theorem 4, where the\nmain technical dif\ufb01culity is to appropriately control the number of pulls for each arm under batch\nconstraints, where there is a random number of active arms in A starting from the second batch. We\nalso refer to a recent work [EKMM19] for a tighter bound on the problem-dependent regret with an\nadaptive grid.\n\nProof of Theorem 4. For notational simplicity we assume that there are K + 1 arms, where arm 0 is\nthe arm with highest expected reward (denoted as (cid:63)), and \u2206i = \u00b5(cid:63) \u2212 \u00b5i \u2265 0 for i \u2208 [K]. De\ufb01ne the\nfollowing events: for i \u2208 [K], let Ai be the event that arm i is eliminated before time tmi, where\ntimes before time tj \u2208 T\n\nj \u2208 [M ] : arm i has been pulled at least \u03c4 (cid:63)\n\n(cid:44) 4\u03b3 log(T K)\n\nmi = min\n\n(cid:26)\n\n(cid:27)\n\n,\n\ni\n\n\u22062\ni\n\nwith the understanding that if the minimum does not exist, we set mi = M and the event Ai occurs.\nLet B be the event that arm (cid:63) is not eliminated throughout the time horizon T . The \ufb01nal \u201cgood event\"\nE is de\ufb01ned as E = (\u2229K\ni=1Ai) \u2229 B. We remark that mi is a random variable depending on the order\nin which the arms are eliminated. The following lemma shows that by our choice of \u03b3 \u2265 12, the\ngood event E occurs with high probability.\nLemma 1. The event E happens with probability at least 1 \u2212 2\nT K .\n\nThe proof of Lemma 1 is postponed to the supplementary materials. By Lemma 1, the expected regret\nRT (\u03c0) (with \u03c0 = \u03c01\n\nBaSE) when the event E does not occur is at most\n\nBaSE or \u03c02\n\nE[RT (\u03c0)1(Ec)] \u2264 T max\ni\u2208[K]\n\n\u2206i \u00b7 P(Ec) = O(1).\n\n(6)\n\n5\n\n\fNext we condition on the event E and upper bound the regret E[RT (\u03c01\nBaSE)1(E)] for the minimax\ngrid Tminimax. The analysis of the geometric grid Tgeometric is entirely analogous, and is deferred to the\nsupplementary materials.\nBaSE, let I0 \u2286 I be the (random) set of arms which are eliminated at the end of the\nFor the policy \u03c01\n\ufb01rst batch, I1 \u2286 I be the (random) set of remaining arms which are eliminated before the last batch,\nand I2 = I \u2212 I0 \u2212 I1 be the (random) set of arms which remain in the last batch. It is clear that the\n\u221a\ntotal regret incurred by arms in I0 is at most t1 \u00b7 maxi\u2208[K] \u2206i = O(\nKa), and it remains to deal\nwith the sets I1 and I2 separately.\nFor arm i \u2208 I1, let \u03c3i be the (random) number of arms which are eliminated before the arm i. Observe\nthat the fraction of pullings of arm i is at most\nbefore arm i is eliminated. Moreover, by the\nde\ufb01nition of tmi, we must have\ni > (number of pullings of arm i before tmi\u22121) \u2265 tmi\u22121\n\u03c4 (cid:63)\nK\n\u221a\nHence, the total regret incurred by pulling an arm i \u2208 I1 is at most (note that tj \u2264 2a\nj = 2, 3,\u00b7\u00b7\u00b7 , M by the choice of the grid)\n\n(cid:112)tmi\u22121 \u2264(cid:112)4\u03b3K log(T K).\n\ntj\u22121 for any\n\n=\u21d2 \u2206i\n\nK\u2212\u03c3i\n\n1\n\n\u2264 2a(cid:112)4\u03b3K log(T K)\n\nK \u2212 \u03c3i\n\n.\n\n\u221a\ntmi\u22121\nK \u2212 \u03c3i\n\n\u2206i \u00b7\n\ntmi\nK \u2212 \u03c3i\n\n\u2264 \u2206i \u00b7 2a\n\nNote that there are at most t elements in (\u03c3i : i \u2208 I1) which are at least K \u2212 t for any t = 2,\u00b7\u00b7\u00b7 , K,\nthe total regret incurred by pulling arms in I1 is at most\n\n2a(cid:112)4\u03b3K log(T K)\n\nK \u2212 \u03c3i\n\n(cid:88)\n\ni\u2208I1\n\n\u2264 2a(cid:112)4\u03b3K log(T K) \u00b7 K(cid:88)\n\n1\nt\n\nt=2\n\n\u2264 2a log K(cid:112)4\u03b3K log(T K).\ntM\u22121 \u2264(cid:112)4\u03b3K log(T K). Hence,\n\n(7)\n\n\u221a\n\nFor any arm i \u2208 I2, by the previous analysis we know that \u2206i\nlet Ti be the number of pullings of arm i, the total regret incurred by pulling arm i \u2208 I2 is at most\n\ntM\u22121 in the minimax grid Tminimax. Since\n\n(cid:115)\n\n\u00b7 2a(cid:112)4\u03b3K log(T K),\n\n\u2206iTi \u2264 Ti\n\n\u2264 Ti\nT\n\u221a\nwhere in the last step we have used that T = tM \u2264 2a\n\n4\u03b3K log(T K)\n\ntM\u22121\n\nTi \u2264 T , the total regret incurred by pulling arms in I2 is at most\n\n(cid:80)\n\ni\u2208I2\n\ni\u2208I2\n\nTi\nT\n\n(cid:88)\n\u00b7 2a(cid:112)4\u03b3K log(T K) \u2264 2a(cid:112)4\u03b3K log(T K).\nBaSE)1(E) \u2264 2a(cid:112)4\u03b3K log(T K)(log K + 1) + O(\n\n\u221a\n\nRT (\u03c01\n\nBy (7) and (8), the inequality\n\n(8)\n\nKa)\n\nholds almost surely. Hence, this inequality combined with (6) and the choice of a in (3) yields the\ndesired upper bound (4).\n\n3 Lower Bound\n\nThis section presents lower bounds for the batched multi-armed bandit problem, where in Section\n3.1 we design a \ufb01xed multiple hypothesis testing problem to show the lower bound for any policies\nunder static grids, while in Section 3.2 we construct different hypotheses for different policies under\ngeneral adaptive grids.\n\n3.1 Static grid\n\nThe proof of Theorem 2 relies on the following lemma.\n\n6\n\n\fLemma 2. For any static grid 0 = t0 < t1 < \u00b7\u00b7\u00b7 < tM = T and the smallest gap \u2206 \u2208 (0,\nthe following minimax lower bound holds for any policy \u03c0 under this grid:\n\nE[RT (\u03c0)] \u2265 \u2206 \u00b7 M(cid:88)\n\nj=1\n\nsup\n\n{\u00b5(i)}K\n\ni=1:\u2206i\u2208{0}\u222a[\u2206,\n\n\u221a\n\nK]\n\n(cid:18)\n\n(cid:19)\n\n.\n\ntj \u2212 tj\u22121\n\n4\n\nexp\n\n\u2212 2tj\u22121\u22062\nK \u2212 1\n\n\u221a\n\nK],\n\nWe \ufb01rst show that Lemma 2 implies Theorem 2 by choosing the smallest gap \u2206 > 0 appropriately.\nBy de\ufb01nitions of the minimax regret R(cid:63)\npro-dep, choosing\n\nmin-max and the problem-dependent regret R(cid:63)\n\u221a\n\n\u2206 = \u2206j =(cid:112)(K \u2212 1)/(tj\u22121 + 1) \u2208 [0,\n\nK] in Lemma 2 yields that\n\n\u221a\n\nmin-max(K, M, T ) \u2265 c0\nR(cid:63)\npro-dep(K, M, T ) \u2265 c0K \u00b7 max\nR(cid:63)\nj\u2208[M ]\n\nK \u00b7 max\nj\u2208[M ]\n\ntj(cid:112)tj\u22121 + 1\n\n,\n\ntj\n\ntj\u22121 + 1\n\n,\n\nfor some numerical constant c0 > 0. Since t0 = 0, tM = T , the lower bounds in Theorem 2 follow.\nNext we employ the general idea of the multiple hypothesis testing to prove Lemma 2. Consider the\nfollowing K candidate reward distributions:\n\nP1 = N (\u2206, 1) \u2297 N (0, 1) \u2297 N (0, 1) \u2297 \u00b7\u00b7\u00b7 \u2297 N (0, 1),\nP2 = N (\u2206, 1) \u2297 N (2\u2206, 1) \u2297 N (0, 1) \u2297 \u00b7\u00b7\u00b7 \u2297 N (0, 1),\nP3 = N (\u2206, 1) \u2297 N (0, 1) \u2297 N (2\u2206, 1) \u2297 \u00b7\u00b7\u00b7 \u2297 N (0, 1),\n\n...\n\nPK = N (\u2206, 1) \u2297 N (0, 1) \u2297 N (0, 1) \u2297 \u00b7\u00b7\u00b7 \u2297 N (2\u2206, 1).\nWe remark that this construction is not entirely symmetric, where the reward distribution of the \ufb01rst\narm is always N (\u2206, 1). The key properties of this construction are as follows:\n1. For any i \u2208 [K], arm i is the optimal arm under reward distribution Pi;\n2. For any i \u2208 [K], pulling a wrong arm incurs a regret at least \u2206 under reward distribution Pi.\n\nAs a result, since the average regret serves as a lower bound of the worst-case regret, we have\ni (\u03c0t (cid:54)= i),\nP t\n\nRt(\u03c0) \u2265 \u2206\n\nEP t\n\nK(cid:88)\n\nT(cid:88)\n\nK(cid:88)\n\nERT (\u03c0) \u2265 1\nK\n\nT(cid:88)\n\n1\nK\n\n{\u00b5(i)}K\n\nsup\n\n\u221a\ni=1:\u2206i\u2208{0}\u222a[\u2206,\n\nK]\n\ni=1\n\nt=1\n\nt=1\n\ni=1\n\ni\n\n(9)\n\n(cid:80)K\n\nwhere P t\ni denotes the distribution of observations available at time t under Pi, and Rt(\u03c0) denotes\nthe instantaneous regret incurred by the policy \u03c0t at time t. Hence, it remains to lower bound the\nquantity 1\nK\nLemma 3. Let Q1,\u00b7\u00b7\u00b7 , Qn be probability measures on some common probability space (\u2126,F), and\n\u03a8 : \u2126 \u2192 [n] be any measurable function (i.e., test). Then for any tree T = ([n], E) with vertex set\n[n] and edge set E, we have\n\ni (\u03c0t (cid:54)= i) for any t \u2208 [T ], which is the subject of the following lemma.\n\ni=1 P t\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nQi(\u03a8 (cid:54)= i) \u2265 (cid:88)\n\n(i,j)\u2208E\n\nexp(\u2212DKL(Qi(cid:107)Qj)).\n\n1\n2n\n\n(cid:80)n\nThe proof of Lemma 3 is deferred to the supplementary materials, and we make some remarks below.\ni=1 Qi(\u03a8 (cid:54)= i) is the Fano\u2019s inequality [CT06],\nRemark 1. A more well-known lower bound for 1\nwhich involves the mutual information I(U ; X) with U \u223c Uniform([n]) and PX|U =i = Qi. However,\nn\nsince I(U ; X) = EPU DKL(PX|U(cid:107)PX ), Fano\u2019s inequality gives a lower bound which depends\n(cid:80)\nlinearly on the pairwise KL divergence rather than exponentially and is thus loose for our purpose.\ni(cid:54)=j exp(\u2212DKL(Qi(cid:107)Qj)), i.e., the summation\nRemark 2. An alternative lower bound is to use 1\n2n2\nis taken over all pairs (i, j) instead of just the edges in a tree. However, this bound is weaker than\nLemma 3, and in the case where Qi = N (i\u2206, 1) for some large \u2206 > 0, Lemma 3 with the\ntree T = ([n],{(1, 2), (2, 3),\u00b7\u00b7\u00b7 , (n \u2212 1, n)}) is tight (giving the rate (exp(\u2212O(\u22062))) while the\nalternative bound loses a factor of n (giving the rate exp(\u2212O(\u22062))/n).\n\n7\n\n\fTo lower bound (9), we apply Lemma 3 with the star tree T = ([n],{(1, i) : 2 \u2264 i \u2264 n}). For\ni \u2208 [K], denote by Ti(t) the number of pulls of arm i anterior to the current batch of t. Hence,\n\n(cid:80)K\ni=1 Ti(t) = tj\u22121 if t \u2208 (tj\u22121, tj]. Moreover, since DKL(P t\n\ni ) = 2\u22062EP t\n\nTi(t), we have\n\n1\n\nK(cid:88)\n\ni=1\n\n1\nK\n\ni (\u03c0t (cid:54)= i) \u2265 1\nP t\n2K\n\nexp(\u2212DKL(P t\n\nK(cid:88)\n\u2265 K \u2212 1\n\ni=2\n\n2K\n\n(cid:32)\n\n1(cid:107)P t\nK(cid:88)\n(cid:33)\n\ni=2\n\n1\n2K\n\ni )) =\n\n1(cid:107)P t\nK(cid:88)\n\nEP t\n\n1\n\ni=2\n\nexp(\u22122\u22062EP t\n\n1\n\nTi(t))\n\n(cid:18)\n\n(cid:19)\n\nexp\n\n\u2212 2\u22062\nK \u2212 1\n\nTi(t)\n\n\u2265 1\n4\n\nexp\n\n\u2212 2\u22062tj\u22121\nK \u2212 1\n\n.\n\n(10)\n\nNow combining (9) and (10) completes the proof of Lemma 2.\n\n3.2 Adaptive grid\n\nNow we investigate the case where the grid may be randomized, and be generated sequentially in\nan adaptive manner. Recall that in the previous section, we construct multiple \ufb01xed hypotheses and\nshow that no policy under a static grid can achieve a uniformly small regret under all hypotheses.\nHowever, this argument breaks down even if the grid is only randomized but not adaptive, due to the\nnon-convex (in (t1,\u00b7\u00b7\u00b7 , tM )) nature of the lower bound in Lemma 2. In other words, we might not\nhope for a single \ufb01xed multiple hypothesis testing problem to work for all policies. To overcome\nthis dif\ufb01culty, a subroutine in the proof of Theorem 3 is to construct appropriate hypotheses after the\npolicy is given (cf. the proof of Lemma 4). We sketch the proof below.\nWe shall only prove the lower bound for the minimax regret, where the analysis of the problem-\ndependent regret is entirely analogous. Consider the following time T1,\u00b7\u00b7\u00b7 , TM \u2208 [1, T ] and gaps\n\u22061,\u00b7\u00b7\u00b7 , \u2206M \u2208 (0,\n\nK] with\n\n\u221a\n\nTj = (cid:98)T\n\n1\u22122\u2212j\n1\u22122\u2212M (cid:99),\n\n\u00b7 T\n\n\u2212 1\u221221\u2212j\n\n2(1\u22122\u2212M ) ,\n\nj \u2208 [M ].\n\n\u2206j =\n\n(11)\nLet T = {t1,\u00b7\u00b7\u00b7 , tM} be any adaptive grid, and \u03c0 be any policy under the grid T . For each\nj \u2208 [M ], we de\ufb01ne the event Aj = {tj\u22121 < Tj\u22121, tj \u2265 Tj} under policy \u03c0 with the convention that\nt0 = 0, tM = T . Note that the events A1,\u00b7\u00b7\u00b7 , AM form a partition of the entire probability space.\nWe also de\ufb01ne the following family of reward distributions: for j \u2208 [M \u2212 1], k \u2208 [K \u2212 1] let\nPj,k = N (0, 1) \u2297 \u00b7\u00b7\u00b7 \u2297 N (0, 1) \u2297 N (\u2206j + \u2206M , 1) \u2297 N (0, 1) \u2297 \u00b7\u00b7\u00b7 \u2297 N (0, 1) \u2297 N (\u2206M , 1),\nwhere the k-th component of Pj,k has a non-zero mean. For j = M, we de\ufb01ne\n\n\u221a\n\nK\n36M\n\nPM = N (0, 1) \u2297 \u00b7\u00b7\u00b7 \u2297 N (0, 1) \u2297 N (\u2206M , 1).\n\nNote that this construction ensures that Pj,k and PM only differs in the k-th component, which is\ncrucial for the indistinguishability results in Lemma 5.\nWe will be interested in the following quantities:\n\nK\u22121(cid:88)\n\nk=1\n\npj =\n\n1\n\nK \u2212 1\n\nPj,k(Aj),\n\nj \u2208 [M \u2212 1],\n\npM = PM (AM ),\n\nwhere Pj,k(A) denotes the probability of the event A given the true reward distribution Pj,k and the\npolicy \u03c0. The importance of these quantities lies in the following lemmas.\nLemma 4. If pj \u2265 1\n\n2M for some j \u2208 [M ], then we have\n\nE[RT (\u03c0)] \u2265 cM\u22122 \u00b7\n\nKT\n\n1\n\n2\u221221\u2212M ,\n\n\u221a\n\n{\u00b5(i)}K\n\nsup\ni=1:\u2206i\u2264\u221a\n\nK\n\nwhere c > 0 is a numerical constant independent of (K, M, T ) and (\u03c0,T ).\n\nLemma 5. The following inequality holds:(cid:80)M\n\nj=1 pj \u2265 1\n2 .\n\nThe detailed proofs of Lemma 4 and Lemma 5 are deferred to the supplementary materials, and we\nonly sketch the ideas here. Lemma 4 states that, if any of the events Aj occurs with a non-small\n\n8\n\n\fprobability in the respective j-th world (i.e., under the mixture of (Pj,k : k \u2208 [K \u2212 1]) or PM ), then\nthe policy \u03c0 has a large regret in the worst case. The intuition behind Lemma 4 is that, if the event\ntj\u22121 \u2264 Tj\u22121 occurs under the reward distribution Pj,k, then the observations in the \ufb01rst (j \u2212 1)\nbatches are not suf\ufb01cient to distinguish Pj,k from its (carefully designed) perturbed version with\nsize of perturbation \u2206j. Furthermore, if in addition tj \u2265 Tj holds, then the total regret is at least\n\u2126(Tj\u2206j) due to the indistinguishability of the \u2206j perturbations in the \ufb01rst j batches. Hence, if Aj\noccurs with a fairly large probability, the resulting total regret will be large as well.\nLemma 5 complements Lemma 4 by stating that at least one pj should be large. Note that if all pj\nj\u2208[M ] pj \u2265 1.\nSince the occurrence of Aj cannot really help to distinguish the j-th world with later ones, Lemma 5\nshows that we may still operate in the same world and arrive at a slightly smaller constant than 1.\nFinally we show how Lemma 4 and Lemma 5 imply Theorem 3. In fact, by Lemma 5, there exists\nsome j \u2208 [M ] such that pj \u2265 (2M )\u22121. Then by Lemma 4 and the arbitrariness of \u03c0, we arrive at the\ndesired lower bound in Theorem 3.\n\nwere de\ufb01ned in the same world, the partition structure of A1,\u00b7\u00b7\u00b7 , AM would imply(cid:80)\n\n4 Experiments\n\nThis section contains some experimental results on the performances of BaSE policy under different\ngrids. The default parameters are T = 5 \u00d7 104, K = 3, M = 3 and \u03b3 = 1, and the mean reward\nis \u00b5(cid:63) = 0.6 for the optimal arm and is \u00b5 = 0.5 for all other arms. In addition to the minimax and\ngeometric grids, we also experiment on the arithmetic grid with tj = jT /M for j \u2208 [M ]. Figure 1\n(a)-(c) display the empirical dependence of the average BaSE regrets under different grids, together\nwith the comparison with the centralized UCB1 algorithm [ACBF02] without any batch constraints.\nWe observe that the minimax grid typically results in a smallest regret among all grids, and M = 4\nbatches appear to be suf\ufb01cient for the BaSE performance to approach the centralized performance.\nWe also compare our BaSE algorithm with the ETC algorithm in [PRCS16] for the two-arm case, and\nFigure 1 (d) shows that BaSE achieves lower regrets than ETC. The source codes of the experiment\ncan be found in https://github.com/Mathegineer/batched-bandit.\n\n(a) Average regret vs. the number of batches M. (b) Average regret vs. the number of arms K.\n\n(c) Average regret vs. the time horizon T .\n\n(d) Comparison of BaSE and ETC.\n\nFigure 1: Empirical regret performances of the BaSE policy.\n\n9\n\n23456700.010.020.030.040.050.060.07Average regretMinimax GridGeometric GridArithmetic GridUCB1246810121416182000.010.020.030.040.050.060.070.080.090.1Average regretMinimax GridGeometric GridArithmetic GridUCB110310400.010.020.030.040.050.060.07Average regretMinimax GridGeometric GridArithmetic GridUCB123456700.010.020.030.040.050.06Average regretBaSE: Minimax GridBaSE: Geometric GridETC: Minimax GridETC: Geometric GridUCB1\fReferences\n\n[AA88] Noga Alon and Yossi Azar. Sorting, approximate sorting, and searching in rounds.\n\nSIAM Journal on Discrete Mathematics, 1(3):269\u2013280, 1988.\n\n[AAAK17] Arpit Agarwal, Shivani Agarwal, Sepehr Assadi, and Sanjeev Khanna. Learning with\nlimited rounds of adaptivity: Coin tossing, multi-armed bandits, and ranking from\npairwise comparisons. In Conference on Learning Theory, pages 39\u201375, 2017.\n\n[AB09] Jean-Yves Audibert and S\u00e9bastien Bubeck. Minimax policies for adversarial and\n\nstochastic bandits. In COLT, pages 217\u2013226, 2009.\n\n[AB10] Jean-Yves Audibert and S\u00e9bastien Bubeck. Regret bounds and minimax policies under\npartial monitoring. Journal of Machine Learning Research, 11(Oct):2785\u20132836, 2010.\n\n[ACBF02] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multi-\n\narmed bandit problem. Machine learning, 47(2-3):235\u2013256, 2002.\n\n[AMS09] Jean-Yves Audibert, R\u00e9mi Munos, and Csaba Szepesv\u00e1ri. Exploration\u2013exploitation\ntradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science,\n410(19):1876\u20131902, 2009.\n\n[AO10] Peter Auer and Ronald Ortner. UCB revisited: Improved regret bounds for the stochastic\nmulti-armed bandit problem. Periodica Mathematica Hungarica, 61(1-2):55\u201365, 2010.\n\n[AWBR09] Alekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar.\nInformation-theoretic lower bounds on the oracle complexity of convex optimization.\nIn Advances in Neural Information Processing Systems, pages 1\u20139, 2009.\n\n[BCB12] S\u00e9bastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and non-\nstochastic multi-armed bandit problems. Foundations and Trends R(cid:13) in Machine Learn-\ning, 5(1):1\u2013122, 2012.\n\n[BK97] Apostolos N Burnetas and Michael N Katehakis. Optimal adaptive policies for markov\n\ndecision processes. Mathematics of Operations Research, 22(1):222\u2013255, 1997.\n\n[BM07] Dimitris Bertsimas and Adam J Mersereau. A learning approach for interactive market-\n\ning to a customer segment. Operations Research, 55(6):1120\u20131135, 2007.\n\n[BMW16] Mark Braverman, Jieming Mao, and S Matthew Weinberg. Parallel algorithms for select\nand partition with noisy comparisons. In Proceedings of the forty-eighth annual ACM\nsymposium on Theory of Computing, pages 851\u2013862. ACM, 2016.\n\n[BPR13] S\u00e9bastien Bubeck, Vianney Perchet, and Philippe Rigollet. Bounded regret in stochastic\nmulti-armed bandits. In Proceedings of the 26th Annual Conference on Learning Theory,\npages 122\u2013134, 2013.\n\n[BT83] B\u00e9la Bollob\u00e1s and Andrew Thomason. Parallel sorting. Discrete Applied Mathematics,\n\n6(1):1\u201311, 1983.\n\n[CBDS13] Nicolo Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching\ncosts and other adaptive adversaries. In Advances in Neural Information Processing\nSystems, pages 1160\u20131168, 2013.\n\n[CG09] Stephen E Chick and Noah Gans. Economic analysis of simulation selection problems.\n\nManagement Science, 55(3):421\u2013437, 2009.\n\n[CT06] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley, New\n\nYork, second edition, 2006.\n\n[DKMR14] Susan Davidson, Sanjeev Khanna, Tova Milo, and Sudeepa Roy. Top-k and clustering\nwith noisy comparisons. ACM Transactions on Database Systems (TODS), 39(4):35,\n2014.\n\n10\n\n\f[DRY18] John Duchi, Feng Ruan, and Chulhee Yun. Minimax bounds on stochastic batched\n\nconvex optimization. In Conference On Learning Theory, pages 3065\u20133162, 2018.\n\n[EDMM06] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping\nconditions for the multi-armed bandit and reinforcement learning problems. Journal of\nmachine learning research, 7(Jun):1079\u20131105, 2006.\n\n[EKMM19] Hossein Esfandiari, Amin Karbasi, Abbas Mehrabian, and Vahab Mirrokni. Batched\n\nmulti-armed bandits with optimal regret. arXiv preprint arXiv:1910.04959, 2019.\n\n[FRPU94] Uriel Feige, Prabhakar Raghavan, David Peleg, and Eli Upfal. Computing with noisy\n\ninformation. SIAM Journal on Computing, 23(5):1001\u20131018, 1994.\n\n[GC11] Aur\u00e9lien Garivier and Olivier Capp\u00e9. The kl-ucb algorithm for bounded stochastic\nbandits and beyond. In Proceedings of the 24th annual conference on learning theory,\npages 359\u2013376, 2011.\n\n[JJNZ16] Kwang-Sung Jun, Kevin G Jamieson, Robert D Nowak, and Xiaojin Zhu. Top arm\nidenti\ufb01cation in multi-armed bandits with batch arm pulls. In AISTATS, pages 139\u2013148,\n2016.\n\n[KCS08] Aniket Kittur, Ed H Chi, and Bongwon Suh. Crowdsourcing user studies with mechan-\nical turk. In Proceedings of the SIGCHI conference on human factors in computing\nsystems, pages 453\u2013456. ACM, 2008.\n\n[LR85] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules.\n\nAdvances in applied mathematics, 6(1):4\u201322, 1985.\n\n[NY83] Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity\n\nand method ef\ufb01ciency in optimization. Wiley, 1983.\n\n[PR13] Vianney Perchet and Philippe Rigollet. The multi-armed bandit problem with covariates.\n\nThe Annals of Statistics, pages 693\u2013721, 2013.\n\n[PRCS16] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, and Erik Snowberg. Batched\n\nbandit problems. The Annals of Statistics, 44(2):660\u2013681, 2016.\n\n[Rob52] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the\n\nAmerican Mathematical Society, 58(5):527\u2013535, 1952.\n\n[Sha13] Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex\n\noptimization. In Conference on Learning Theory, pages 3\u201324, 2013.\n\n[Tho33] William R Thompson. On the likelihood that one unknown probability exceeds another\n\nin view of the evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[Tsy08] A. Tsybakov. Introduction to Nonparametric Estimation. Springer-Verlag, 2008.\n\n[Val75] Leslie G Valiant. Parallelism in comparison problems. SIAM Journal on Computing,\n\n4(3):348\u2013355, 1975.\n\n[Vog60] Walter Vogel. A sequential design for the two armed bandit. The Annals of Mathematical\n\nStatistics, 31(2):430\u2013443, 1960.\n\n11\n\n\f", "award": [], "sourceid": 277, "authors": [{"given_name": "Zijun", "family_name": "Gao", "institution": "Stanford University"}, {"given_name": "Yanjun", "family_name": "Han", "institution": "Stanford University"}, {"given_name": "Zhimei", "family_name": "Ren", "institution": "Stanford University"}, {"given_name": "Zhengqing", "family_name": "Zhou", "institution": "Stanford University"}]}