{"title": "Community Exploration: From Offline Optimization to Online Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5474, "page_last": 5483, "abstract": "We introduce the community exploration problem that has various real-world applications such as online advertising. In the problem, an explorer allocates limited budget to explore communities so as to maximize the number of members he could meet. We provide a systematic study of the community exploration problem, from offline optimization to online learning. For the offline setting where the sizes of communities are known, we prove that the greedy methods for both of non-adaptive exploration and adaptive exploration are optimal. For the online setting where the sizes of communities are not known and need to be learned from the multi-round explorations, we propose an ``upper confidence'' like algorithm that achieves the logarithmic regret bounds. By combining the feedback from different rounds, we can achieve a constant regret bound.", "full_text": "Community Exploration: From Of\ufb02ine Optimization\n\nto Online Learning\n\nXiaowei Chen1, Weiran Huang2, Wei Chen3, John C.S. Lui1\n\n1The Chinese University of Hong Kong\n\n2Huawei Noah\u2019s Ark Lab, 3Microsoft Research\n\n1{xwchen, cslui}@cse.cuhk.edu.hk, 2huang.inbox@outlook.com\n\n3weic@microsoft.com\n\nAbstract\n\nWe introduce the community exploration problem that has many real-world appli-\ncations such as online advertising. In the problem, an explorer allocates limited\nbudget to explore communities so as to maximize the number of members he could\nmeet. We provide a systematic study of the community exploration problem, from\nof\ufb02ine optimization to online learning. For the of\ufb02ine setting where the sizes of\ncommunities are known, we prove that the greedy methods for both of non-adaptive\nexploration and adaptive exploration are optimal. For the online setting where the\nsizes of communities are not known and need to be learned from the multi-round\nexplorations, we propose an \u201cupper con\ufb01dence\u201d like algorithm that achieves the\nlogarithmic regret bounds. By combining the feedback from different rounds, we\ncan achieve a constant regret bound.\n\n1\n\nIntroduction\n\nIn this paper, we introduce the community exploration problem, which is abstracted from many\nreal-world applications. Consider the following hypothetical scenario. Suppose that John just entered\nthe university as a freshman. He wants to explore different student communities or study groups at\nthe university to meet as many new friends as possible. But he only has a limited time to spend on\nexploring different communities, so his problem is how to allocate his time and energy to explore\ndifferent student communities to maximize the number of people he would meet.\nThe above hypothetical community exploration scenario can also \ufb01nd similar counterparts in serious\nbusiness and social applications. One example is online advertising. In this application, an advertiser\nwants to promote his products via placing advertisements on different online websites. The website\nwould show the advertisements on webpages, and visitors to the websites may click on the advertise-\nments when they view these webpages. The advertiser wants to reach as many unique customers as\npossible, but he only has a limited budget to spend. Moreover, website visitors come randomly, so it\nis not guaranteed that all visitors to the same website are unique customers. So the advertiser needs\nto decide how to spend the budget on each website to reach his customers. Of course, intuitively he\nshould spend more budget on larger communities, but how much? And what if he does not know the\nuser size of every website? In this case, each website is a community, consisting of all visitors to this\nwebsite, and the problem can be modeled as a community exploration problem. Another example\ncould be a social worker who wants to reach a large number of people from different communities to\ndo social studies or improve the social welfare for a large population, while he also needs to face the\nbudget constraint and uncertainty about the community.\nIn this paper, we abstract the common features of these applications and de\ufb01ne the following\ncommunity exploration problem that re\ufb02ects the common core of the problem. We model the problem\nwith m disjoint communities C1, . . . , Cm with C = \u222am\ni=1Ci, where each community Ci has di\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f+ with(cid:80)m\n\nmembers. Each time when one explores (or visit) a community Ci, he would meet one member of\nCi uniformly at random.1 Given a budget K, the goal of community exploration is to determine the\nbudget allocation k = (k1, . . . , km) \u2208 Zm\ni=1 ki \u2264 K, such that the total number of distinct\nmembers met is maximized when each community Ci is explored ki times.\nWe provide a systematic study of the above community exploration problem, from of\ufb02ine optimization\nto online learning. First, we consider the of\ufb02ine setting where the community sizes are known. In\nthis setting, we further study two problem variants \u2014 the non-adaptive version and the adaptive\nversion. The non-adaptive version requires that the complete budget allocation k is decided before\nthe exploration is started, while the adaptive version allows the algorithm to use the feedback from\nthe exploration results of the previous steps to determine the exploration target of the next step. In\nboth cases, we prove that the greedy algorithm provides the optimal solution. While the proof for\nthe non-adaptive case is simple, the proof that the adaptive greedy policy is optimal is much more\ninvolved and relies on a careful analysis of transitions between system statuses. The proof techniques\nmay be applicable in the analysis of other related problems.\nSecond, we consider the online setting where the community sizes are unknown in advance, which\nmodels the uncertainty about the communities in real applications. We apply the multi-armed bandit\n(MAB) framework to this task, in which community explorations proceed in multiple rounds, and\nin each round we explore communities with a budget of K, use the feedback to learn about the\ncommunity size, and adjust the exploration strategy in future rounds. The reward of a round is the\nthe expected number of unique people met in the round. The goal is to maximize the cumulative\nreward from all rounds, or minimizing the regret, which is de\ufb01ned as the difference in cumulative\nreward between always using the optimal of\ufb02ine algorithm when knowing the community sizes and\nusing the online learning algorithm. Similar to the of\ufb02ine case, we also consider the non-adaptive and\nadaptive version of exploration within each round. We provide theoretical regret bounds of O(log T )\nfor both versions, where T is the number of rounds, which is asymptotically tight. Our analysis uses\nthe special feature of the community exploration problem, which leads to improved coef\ufb01cients in the\nregret bounds compared with a simple application of some existing results on combinatorial MABs.\nMoreover, we also discuss the possibility of using the feedback in previous round to turn the problem\ninto the full information feedback model, which allows us to provide constant regret in this case.\nIn summary, our contributions include: (a) proposing the study of the community exploration problem\nto re\ufb02ect the core of a number of real-world applications; and (b) a systematic study of the problem\nwith rigorous theoretical analysis that covers of\ufb02ine non-adaptive, of\ufb02ine adaptive, online non-\nadaptive and online adaptive cases, which model the real-world situations of adapting to feedback\nand handling uncertainty.\n\n2 Problem De\ufb01nition\nWe model the problem with m disjoint communities C1, . . . , Cm with C = \u222am\ni=1Ci, where each\ncommunity Ci has di members. Each exploration (or visit) of one community Ci returns a member\nof Ci uniformly at random, and we have a total budget of K for explorations. Since we can trivially\nexplore each community once when K \u2264 m, we assume that K > m.\nWe consider both the of\ufb02ine setting where the sizes of the communities d1, . . . , dm are known, and\nthe online setting where the sizes of the communities are unknown. For the of\ufb02ine setting, we further\nconsider two different problems: (1) non-adaptive exploration and (2) adaptive exploration. For\nthe non-adaptive exploration, the explorer needs to predetermine the budget allocation k before the\nexploration starts, while for the adaptive exploration, she can sequentially select the next community\nto explore based on previous observations (the members met in the previous community visits).\nFormally, we use pair (i, \u03c4 ) to represent the \u03c4-th exploration of community Ci, called an item. Let\nE = [m] \u00d7 [K] be the set of all possible items. A realization is a function \u03c6 : E \u2192 C mapping\nevery possible item (i, \u03c4 ) to a member in the corresponding community Ci, and \u03c6(i, \u03c4 ) represents the\nmember met in the exploration (i, \u03c4 ). We use \u03a6 to denote a random realization, and the randomness\ncomes from the exploration results. From the description above, \u03a6 follows the distribution such that\n\u03a6(i, \u03c4 ) \u2208 Ci is selected uniformly at random from Ci and is independent of all other \u03a6(i(cid:48), \u03c4(cid:48))\u2019s.\n\n1The model can be extended to meet multiple members per visit, but for simplicity, we consider meeting one\n\nmember per visit in this paper.\n\n2\n\n\fof distinct members met, i.e., R(k, \u03c6) =(cid:80)m\n\nFor a budget allocation k = (k1, . . . , km) and a realization \u03c6, we de\ufb01ne the reward R as the number\n\u03c4 =1{\u03c6(i, \u03c4 )}|, where |\u00b7| is the cardinality of the set.\nThe goal of the non-adaptive exploration is to \ufb01nd an optimal budget allocation k\u2217 = (k\u2217\n1, . . . , k\u2217\nm)\nwith given budget K, which maximizes the expected reward taken over all possible realizations, i.e.,\n(1)\n\nE\u03a6 [R(k, \u03a6)] .\n\ni=1 |\u222aki\n\nk\u2217 \u2208 arg max\nk : (cid:107)k(cid:107)1\u2264K\n\nFor the adaptive exploration, the explorer sequentially picks a community to explore, meets a random\nmember of the chosen community, then picks the next community, meets another random member\nof that community, and so on, until the budget is used up. After each selection, the observations so\nfar can be represented as a partial realization \u03c8, a function from the subset of E to C = \u222am\ni=1Ci.\nSuppose that each community Ci has been explored ki times. Then the partial realization \u03c8 is a\nfunction mapping items in \u222am\ni=1{(i, 1), . . . , (i, ki)} (which is also called the domain of \u03c8, denoted\nas dom(\u03c8)) to members in communities. The partial realization \u03c8 records the observation on the\nsequence of explored communities and the members met in this sequence. We say that a partial\nrealization \u03c8 is consistent with realization \u03c6, denoted as \u03c6 \u223c \u03c8, if for all item (i, \u03c4 ) in the domain\nof \u03c8, we have \u03c8(i, \u03c4 ) = \u03c6(i, \u03c4 ). The strategy to explore the communities adaptively is encoded\nas a policy. The policy, denoted as \u03c0, is a function mapping \u03c8 to an item in E, specifying which\ncommunity to explore next under the partial realization. De\ufb01ne \u03c0K(\u03c6) = (k1, . . . , km), where ki\nis the times the community Ci is explored via policy \u03c0 under realization \u03c6 with budget K. More\nspeci\ufb01cally, starting from the partial realization \u03c80 with empty domain, for every current partial\nrealization \u03c8s at step s, policy \u03c0 determines the next community \u03c0(\u03c8s) to explore, meet the member\n\u03c6(\u03c0(\u03c8s)), such that the new partial realization \u03c8s+1 is adding the mapping from \u03c0(\u03c8s) to \u03c6(\u03c0(\u03c8s))\non top of \u03c8s. This iteration continues until the communities have been explored K times, and\n\u03c0K(\u03c6) = (k1, . . . , km) denotes the resulting exploration vector. The goal of the adaptive exploration\nis to \ufb01nd an optimal policy \u03c0\u2217 to maximize the expected adaptive reward, i.e.,\n\n\u03c0\u2217 \u2208 arg max\n\n\u03c0\n\nE\u03a6 [R(\u03c0K(\u03a6), \u03a6)] .\n\n(2)\n\nWe next consider the online setting of community exploration. The learning process proceeds in\ndiscrete rounds. Initially, the size of communities d = (d1, . . . , dm) is unknown. In each round\nt \u2265 1, the learner needs to determine an allocation or a policy (called an \u201caction\u201d) based on the\nprevious-round observations to explore communities (non-adaptively or adaptively). When an action\nis played, the sets of encountered members for every community are observed as the feedback to the\nplayer. A learning algorithm A aims to cumulate as much reward (i.e., number of distinct members)\nas possible by selecting actions properly at each round. The performance of a learning algorithm is\nmeasured by the cumulative regret. Let \u03a6t be the realization at round t. If we explore the communities\nwith predetermined budget allocation in each round, the T -round (non-adaptive) regret of a learning\nalgorithm A is de\ufb01ned as\n\nRegA\n\n\u00b5(T ) = E\u03a61,...,\u03a6T\n\n\u2217\n\n, \u03a6t) \u2212 R(kA\n\nt , \u03a6t)\n\nR(k\n\n,\n\n(3)\n\nwhere the budget allocation kA\nis selected by algorithm A in round t. If we explore the communities\nt\nadaptively in each round, then the T -round (adaptive) regret of a learning algorithm A is de\ufb01ned as\n\nRegA\n\n\u00b5(T ) = E\u03a61,...,\u03a6T\n\nK (\u03a6t), \u03a6t) \u2212 R(\u03c0A,t\n\u2217\n\nR(\u03c0\n\nK (\u03a6t), \u03a6t)\n\n,\n\n(4)\n\nwhere \u03c0A,t is a policy selected by algorithm A in round t. The goal of the learning problem is to\ndesign a learning algorithm A which minimizes the regret de\ufb01ned in (3) and (4).\n\nt=1\n\n3 Of\ufb02ine Optimization for Community Exploration\n\n3.1 Non-adaptive Exploration Algorithms\nIf Ci is explored ki times, each member in Ci is encountered at least once with probability 1 \u2212 (1 \u2212\n1/di)ki . Thus we have E\u03a6[|{\u03a6(i, 1), . . . , \u03a6(i, ki)}|] = di(1 \u2212 (1 \u2212 1/di)ki). Hence E\u03a6 [R(k, \u03a6)]\nis a function of only the budget allocation k and the size d = (d1, . . . , dm) of all communities.\n\n3\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\n(cid:35)\n\n\fFor i \u2208 [m], ki \u2190 0\nfor s = 1, . . . , K do\n\nAlgorithm 1 Non-Adaptive community exploration with optimal budget allocation\n1: procedure CommunityExplore({\u00b51, . . . , \u00b5m}, K, non-adaptive)\n2:\n3:\n4:\n5:\n6:\n7: end procedure\n\ni\u2217 \u2190 a random elements in arg maxi(1 \u2212 \u00b5i)ki\nki\u2217 \u2190 ki\u2217 + 1\n\nFor i \u2208 [m], explore Ci for ki times, and put the uniformly met members in multi-set Si\n\n(cid:46) O(log m) via using priority queue\n\n(cid:46) Line 2-5: budget allocation\n\nFor i \u2208 [m], Si \u2190 \u2205, ci \u2190 0\nfor s = 1, . . . , K do\n\nAlgorithm 2 Adaptive community exploration with greedy policy\n1: procedure CommunityExplore({\u00b51, . . . , \u00b5m}, K, adaptive)\n2:\n3:\n4:\n5:\n6:\n7:\n8: end procedure\n\ni\u2217 \u2190 a random elements in arg maxi 1 \u2212 \u00b5ici\nv \u2190 a random member met when Ci\u2217 is explored\nif v /\u2208 Si\u2217 then ci\u2217 \u2190 ci\u2217 + 1\nSi\u2217 \u2190 Si\u2217 \u222a {v}\n\n(cid:46) Line 2-7: adaptively explore communities with policy \u03c0g\n\n(cid:46) v is not met before\n\nLet \u00b5i = 1/di, and vector \u00b5 = (1/d1, . . . , 1/dm). Henceforth, we treat \u00b5 as the parameter of the\nproblem instance, since it is bounded with \u00b5 \u2208 [0, 1]m. Let rk(\u00b5) = E\u03a6[R(k, \u03a6)] be the expected\nreward for the budget allocation k. Based on the above discussion, we have\n\nm(cid:88)\n\nm(cid:88)\n\nrk(\u00b5) =\n\ndi(1 \u2212 (1 \u2212 1/di)ki) =\n\n(1 \u2212 (1 \u2212 \u00b5i)ki)/\u00b5i.\n\n(5)\n\ni=1\n\ni=1\n\nSince ki must be integers, a traditional method like Lagrange Multipliers cannot be applied to solve\nthe optimization problem de\ufb01ned in Eq. (1). We propose a greedy method consisting of K steps to\ncompute the feasible k\u2217. The greedy method is described in Line 2-5 of Algo. 1.\nTheorem 1. The greedy method obtains an optimal budget allocation.\n\nThe time complexity of the greedy method is O(K log m), which is not ef\ufb01cient for large K. We\n\ufb01nd that starting from the initial allocation ki =\n, the greedy method can \ufb01nd the\noptimal budget allocation in O(m log m)2. (See supplementary materials.)\n\n(cid:108) (K\u2212m)/ ln(1\u2212\u00b5i)\n(cid:80)m\nj=1 1/ ln(1\u2212\u00b5j )\n\n(cid:109)\n\n3.2 Adaptive Exploration Algorithms\nWith a slight abuse of notations, we also de\ufb01ne r\u03c0(\u00b5) = E\u03a6 [R(\u03c0K(\u03a6), \u03a6)], since the expected\nreward is the function of the policy \u03c0 and the vector \u00b5. De\ufb01ne ci(\u03c8) as the number of distinct\nmembers we met in community Ci under partial realization \u03c8. Then 1 \u2212 ci(\u03c8)/di is the probability\nthat we can meet a new member in the community Ci if we explore community Ci one more time. A\nnatural approach is to explore community Ci\u2217 such that i\u2217 \u2208 arg maxi\u2208[m] 1 \u2212 ci(\u03c8)/di when we\nhave partial realization \u03c8. We call such policy as the greedy policy \u03c0g. The adaptive community\nexploration with greedy policy is described in Algo. 2. One could show that our reward function is\nactually an adaptive submodular function, for which the greedy policy is guaranteed to achieve at\nleast (1\u22121/e) of the maximized expected reward [13]. However, the following theorem shows that\nfor our community exploration problem, our greedy policy is in fact optimal.\nTheorem 2. Greedy policy is the optimal policy for our adaptive exploration problem.\n\nProof sketch. Note that the greedy policy chooses the next community only based on the fraction\nof unseen members. It does not care which members are already met. Thus, we de\ufb01ne si as the\npercentage of members we have not met in a community Ci. We introduce the concept of status,\ndenoted as s = (s1, . . . , sm). The greedy policy chooses next community based on the current\n\n2We thank Jing Yu from School of Mathematical Sciences at Fudan University for her method to \ufb01nd a good\n\ninitial allocation, which leads to a faster greedy method.\n\n4\n\n\f2Ti\n\nAlgorithm 3 Combinatorial Lower Con\ufb01dence Bound (CLCB) algorithm\nInput budget for each round K, method (non-adaptive or adaptive)\n1: For i \u2208 [m], Ti \u2190 0 (number of pairs), Xi \u2190 0 (collision counting), \u02c6\u00b5i \u2190 0 (empirical mean)\n2: for t = 1, 2, 3, . . . do\n(cid:46) Line 2-8: online learning\n(cid:46) con\ufb01dence radius\n3:\n(cid:46) lower con\ufb01dence bound\n4:\n\u00b5m}, K, method) (cid:46) Si: set of met members\n5:\n\u00af\n(cid:46) update number of (member) pairs we observe\n6:\n(cid:46) Si[x]: x-th element in Si\n7:\n(cid:46) update empirical mean\n8:\n\nFor i \u2208 [m],\n{S1, . . . ,Sm} \u2190 CommunityExplore({\nFor i \u2208 [m], Ti \u2190 Ti + (cid:98)|Si| /2(cid:99)\nFor i \u2208 [m] and |Si| > 1, \u02c6\u00b5i \u2190 Xi/Ti\n\n\u00b5i \u2190 max{0, \u02c6\u00b5i \u2212 \u03c1i}\n\u00af\n\n1{Si[2x \u2212 1] = Si[2x]}\n\n(\u03c1i = 0 if Ti = 0)\n\n\u00b51, . . . ,\n\u00af\n\nFor i \u2208 [m], \u03c1i \u2190(cid:113) 3 ln t\nFor i \u2208 [m], Xi \u2190 Xi +(cid:80)(cid:98)|Si|(cid:99)/2\n\u03c4 =1{\u03c6(i, \u03c4 )}(cid:12)(cid:12)(cid:12)(cid:17)\n(cid:12)(cid:12)(cid:12)(cid:83)ki\n\n(cid:16)(cid:80)m\n\nx=1\n\ni=1\n\nstatus. In the proof, we further extend the de\ufb01nition of reward with a non-decreasing function f as\n. Note that the reward function corresponding to the original\nR(k, \u03c6) = f\ncommunity exploration problem is simply the identity function f (x) = x. Let F\u03c0(\u03c8, t) denote the\nexpected marginal gain when we further explore communities for t steps with policy \u03c0 starting from\na partial realization \u03c8. We want to prove that for all \u03c8, t and \u03c0, F\u03c0g (\u03c8, t) \u2265 F\u03c0(\u03c8, t), where \u03c0g is\nthe greedy policy and \u03c0 is an arbitrary policy. If so, we simply take \u03c8 = \u2205, and F\u03c0g (\u2205, t) \u2265 F\u03c0(\u2205, t)\nfor every \u03c0 and t exactly shows that \u03c0g is optimal. We prove the above result by an induction on t.\n\nLet Ci be the community chosen by \u03c0 based on the partial realization \u03c8. De\ufb01ne c(\u03c8) =(cid:80)\n\ni ci(\u03c8)\nand \u2206\u03c8,f = f (c(\u03c8) + 1) \u2212 f (c(\u03c8)). We \ufb01rst claim that F\u03c0g (\u03c8, 1) \u2265 F\u03c0(\u03c8, 1) holds for all \u03c8 and \u03c0\nwith the fact that F\u03c0(\u03c8, 1) = (1 \u2212 \u00b5ici(\u03c8))\u2206\u03c8,f . Note that the greedy policy \u03c0g chooses Ci\u2217 with\ni\u2217 \u2208 arg maxi(1 \u2212 \u00b5ici(\u03c8)). Hence, F\u03c0g (\u03c8, 1) \u2265 F\u03c0(\u03c8, 1).\nNext we prove that F\u03c0g (\u03c8, t+1) \u2265 F\u03c0(\u03c8, t+1) based on the assumption that F\u03c0g (\u03c8, t(cid:48)) \u2265 F\u03c0(\u03c8, t(cid:48))\nholds for all \u03c8, \u03c0, and t(cid:48) \u2264 t. An important observation is that F\u03c0g (\u03c8, t) has equal value for any\npartial realization \u03c8 associated with the same status s since the status is enough for the greedy\npolicy to determine the choice of next community. Formally, we de\ufb01ne Fg(s, t) = F\u03c0g (\u03c8, t) for\nany partial realization that satis\ufb01es s = (1 \u2212 c1(\u03c8)/d1, . . . , 1 \u2212 cm(\u03c8)/dm). Let Ci\u2217 denote the\ncommunity chosen by policy \u03c0g under realization \u03c8, i.e., i\u2217 \u2208 arg maxi\u2208[m] 1 \u2212 \u00b5ici(\u03c8). Let I i be\nthe m-dimensional unit vector with one in the i-th entry and zeros in all other entries. We show that\nF\u03c0(\u03c8, t + 1) \u2264 ci(\u03c8) \u00b7 \u00b5iFg(s, t) + (di \u2212 ci(\u03c8)) \u00b7 \u00b5iFg(s \u2212 \u00b5iI i, t) + (1 \u2212 \u00b5ici(\u03c8))\u2206\u03c8,f\n\n\u2264 \u00b5i\u2217 ci\u2217 (\u03c8)Fg(s, t) + (1 \u2212 \u00b5i\u2217 ci\u2217 (\u03c8))Fg(s \u2212 \u00b5i\u2217 I i\u2217 , t) + (1 \u2212 \u00b5i\u2217 ci\u2217 (\u03c8))\u2206\u03c8,f\n= Fg(s, t + 1) = F\u03c0g (\u03c8, t + 1).\n\nThe \ufb01rst line is derived directly from the de\ufb01nition and the assumption. The key is to prove the\ncorrectness of Line 2 in above inequality. It indicates that if we choose a sub-optimal community at\n\ufb01rst, and then we switch back to the greedy policy, the expected reward would be smaller. The proof\nis nontrivial and relies on a careful analysis based on the stochastic transitions among status vectors.\nWe leave detailed analysis in the supplementary materials. Note that the reward function r\u03c0(\u00b5) is not\nnecessary adaptive submodular if we extend the reward with the non-decreasing function f. Hence,\na (1 \u2212 1/e) guarantee for adaptive submodular function [13] is not applicable in this scenario. Our\nanalysis scheme can be applied to any adaptive problems with similar structures.\n\n4 Online Learning for Community Exploration\n\nThe key of the learning algorithm is to estimate the community sizes. The size estimation problem\nis de\ufb01ned as inferring unknown set size di from random samples obtained via uniformly sampling\nwith replacement from the set Ci. Various estimators have been proposed [3, 8, 10, 16] for the\nestimation of di. The core idea of estimators in [3, 16] are based on \u201ccollision counting\u201d. Let (u, v)\nbe an unordered pair of two random elements from Ci and Yu,v be a pair collision random variable\nthat takes value 1 if u = v (i.e., (u, v) is a collision) and 0 otherwise. It is easy to verify that\nE[Yu,v] = 1/di = \u00b5i. Suppose we independently take Ti pairs of elements from Ci and Xi of them\nare collisions. Then E[Xi/Ti] = 1/di = \u00b5i. The size di can be estimated by Ti/Xi (the estimator is\nvalid when Xi > 0).\n\n5\n\n\fwe need at least (1+(cid:112)8di ln 1/\u03b4 + 1)/2 uniformly sampled elements in Ci to make sure that Xi > 0\n\nWe present our CLCB algorithm in Algorithm 3. In the algorithm, we maintain an unbiased estimation\nof \u00b5i instead of di for each community Ci for the following reasons. Firstly, Ti/Xi is not an unbiased\nestimator of di since E[Ti/Xi] \u2265 di according to the Jensen\u2019s inequality. Secondly, the upper\ncon\ufb01dence bound of Ti/Xi depends on di, which is unknown in our online learning problem. Thirdly,\nwith probability at least 1 \u2212 \u03b4. We feed the lower con\ufb01dence bound\n\u00b5i to the exploration process\n\u00af\nsince our reward function increases as \u00b5i decreases. The idea is similar to CUCB algorithm [7].\nThe lower con\ufb01dence bound is small if community Ci is not explored often (Ti is small). Small\n\u00b5i\n\u00af\nmotivates us to explore Ci more times. The feedbacks after the exploration process at each round\nare the sets of encountered members S1, . . . ,Sm in communities C1, . . . , Cm respectively. Note that\nfor each i \u2208 [m], all pairs of elements in Si, namely {(x, y) | x \u2264 y, x \u2208 Si, y \u2208 Si\\{x}} are not\nmutually independent. Thus, we only use (cid:98)|Si| /2(cid:99) independent pairs. Therefore, Ti is updated as\nTi + (cid:98)|Si| /2(cid:99) at each round. In each round, the community exploration could either be non-adaptive\nor adaptive, and the following regret analysis separately discuss these two cases.\n\n4.1 Regret Analysis for the Non-adaptive Version\n\nThe non-adaptive bandit learning model \ufb01ts into the general combinatorial multi-armed bandit\n(CMAB) framework of [7, 20] that deals with nonlinear reward functions. In particular, we can treat\nthe pair collision variable in each community Ci as a base arm, and our expected reward in Eq. (5) is\nnon-linear, and it satis\ufb01es the monotonicity and bounded smoothness properties (See Properties 1\nand 2). However, directly applying the regret result from [7, 20] will give us an inferior regret bound\nfor two reasons. First, in our setting, in each round we could have multiple sample feedback for\neach community, meaning that each base arm could be observed multiple times, which is not directly\ncovered by CMAB. Second, to use the regret result in [7, 20], the bounded smoothness property\nneeds to have a bounded smoothness constant independent of the actions, but we can have a better\nresult by using a tighter form of bounded smoothness with action-related coef\ufb01cients. Therefore, in\nthis section, we provide a better regret result by adapting the regret analysis in [20].\n\nWe de\ufb01ne the gap \u2206k = rk\u2217 (\u00b5) \u2212 rk(\u00b5) for all action k satisfying(cid:80)m\n\nmin = min\u2206k>0,ki>1 \u2206k and \u2206i\n\ncommunity Ci, we de\ufb01ne \u2206i\nconvention, if there is no action k with ki > 1 such that \u2206k > 0, we de\ufb01ne \u2206i\nmax = 0. Furthermore, de\ufb01ne \u2206min = mini\u2208[m] \u2206i\n\u2206i\nK(cid:48) = K \u2212 m + 1. We have the regret for Algo. 3 as follows.\nTheorem 3. Algo. 3 with non-adaptive exploration method has regret as follows.\n\nmin and \u2206max = maxi\u2208[m] \u2206i\n\ni=1 ki = K. For each\nmax = max\u2206k>0,ki>1 \u2206k. As a\nmin = \u221e and\nmax. Let\n\n(cid:32) m(cid:88)\n\ni=1\n\nK(cid:48)3 log T\n\n\u2206i\n\nmin\n\n(cid:33)\n\n.\n\n(6)\n\nReg\u00b5(T ) \u2264 m(cid:88)\n\ni=1\n\n48(cid:0)K(cid:48)\n\n(cid:1)K ln T\n\n2\n\u2206i\n\nmin\n\n(cid:18)K(cid:48)\n\n(cid:19)\n\n2\n\n(cid:106) K(cid:48)\n\n(cid:107)\n\n2\n\n3\n\n\u03c02\n\n+ 2\n\nm +\n\nm\u2206max = O\n\nThe proof of the above theorem is an adaption of the proof of Theorem 4 in [20], and the full proof\ndetails as well as the detailed comparison with the original CMAB framework result are included in\nthe supplementary materials. We brie\ufb02y explain our adaption that leads to the regret improvement.\nWe rely on the following monotonicity and 1-norm bounded smoothness properties of our expected\nreward function rk(\u00b5), similar to the ones in [7, 20].\nProperty 1 (Monotonicity). The reward function rk(\u00b5) is monotonically decreasing, i.e., for any two\ni \u2200i \u2208 [m].\nvectors \u00b5 = (\u00b51, . . . , \u00b5m) and \u00b5(cid:48) = (\u00b5(cid:48)\nProperty 2 (1-Norm Bounded Smoothness). The reward function rk(\u00b5) satis\ufb01es the 1-norm bounded\nsmoothness property, i.e., for any two vectors \u00b5 = (\u00b51,\u00b7\u00b7\u00b7 , \u00b5m) and \u00b5(cid:48) = (\u00b5(cid:48)\nm), we have\n\nm), we have rk(\u00b5) \u2265 rk(\u00b5(cid:48)) if \u00b5i \u2264 \u00b5(cid:48)\n(cid:1)(cid:80)m\n1,\u00b7\u00b7\u00b7 , \u00b5(cid:48)\ni=1 |\u00b5i \u2212 \u00b5(cid:48)\ni|.\n\n|rk(\u00b5) \u2212 rk(\u00b5(cid:48))| \u2264(cid:80)m\n\n(cid:1)|\u00b5i \u2212 \u00b5(cid:48)\n\ni| \u2264(cid:0)K(cid:48)\n\n1, . . . , \u00b5(cid:48)\n\n(cid:0)ki\n\ni=1\n\n2\n\n2\n\nWe remark that if we directly apply the CMAB regret bound of Theorem 4 in [20], we need to revise\nthe update procedure in Lines 6-8 of Algo. 3 so that each round we only update one observation for\neach community Ci if |Si| > 1. Then we would obtain a regret bound O\n, which\nmeans that our regret bound in Eq. (6) has an improvement of O(K(cid:48)m). This improvement is exactly\ndue to the reason we give earlier, as we now explain with more details.\nFor all the random variables introduced in Algo. 3, we add the subscript t to denote their value at the\nend of round t. For example, Ti,t is the value of Ti at the end of round t. First, the improvement of\n\nK(cid:48)4m log T\n\n\u2206i\n\nmin\n\ni\n\n(cid:16)(cid:80)\n\n(cid:17)\n\n6\n\n\f2\n\n2\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\n2\n\n2\n\ni=1\n\n2\n\n2\n\ni=1\n\ni=1\n\n2\n\n(cid:80)\n\nc(ki,t\u22121)\n\n(cid:0)ki\n\n2\n\nt=1 \u2206kt = c(cid:80)m\n\nis the form of a right Riemann summation, which achieves the maximum value when ki,t = K(cid:48).\nHere Li(T ) is a ln T function with some constants related with community Ci. Hence the regret\nuse the original CMAB framework, we need to set Ti,t = Ti,t\u22121 + 1{ki,t > 1}. In this case,\n\nt\u22651,Ti,t\u2264Li(T )(cid:98) ki,t\n\nthe factor m comes from the use of a tighter bounded smoothness in Property 2, namely, we use the\ni|. The CMAB framework in [20] requires the\nbounded smoothness constant to be independent of actions. So to apply Theorem 4 in [20], we have\ni|,\nimprovement of the O(K(cid:48)) factor, more precisely a factor of (K(cid:48) \u2212 1)/2, is achieved by utilizing\nmultiple feedback in a single round and a more careful analysis of the regret utilizing the property\nof the right Riemann summation. Speci\ufb01cally, let \u2206kt = rk\u2217 (\u00b5) \u2212 rkt(\u00b5) be the reward gap.\n\ni| instead of(cid:0)K(cid:48)\n(cid:1)|\u00b5i\u2212\u00b5(cid:48)\n(cid:0)ki\n(cid:1)(cid:80)m\nbound(cid:80)m\ni=1 |\u00b5i\u2212\u00b5(cid:48)\n(cid:1)(cid:80)m\nto use the bound(cid:0)K(cid:48)\n(cid:1)|\u00b5i \u2212 \u00b5(cid:48)\ni|. However, in our case, when using bound(cid:80)m\n(cid:1) \u2264(cid:0)K(cid:48)\n(cid:0)ki\n(cid:1) to improve the result by a factor of m. Second, the\nwe are able to utilize the fact(cid:80)m\ni=1 |\u00b5i \u2212 \u00b5(cid:48)\n/(cid:112)Ti,t\u22121 \u2264\nWhen the estimate is within the con\ufb01dence radius, we have \u2206kt \u2264 (cid:80)m\ni=1(cid:98)ki,t/2(cid:99)/(cid:112)Ti,t\u22121, where c is a constant. In Algo. 3, we have Ti,t = Ti,t\u22121 + (cid:98)ki,t/2(cid:99)\nc(cid:80)m\nt\u22651,Ti,t\u2264Li(T )(cid:98)ki,t/2(cid:99)/(cid:112)Ti,t\u22121\nbecause we allow multiple feedback in a single round. Then(cid:80)\n(cid:112)Li(T ). However, if we\n2 (cid:99)/(cid:112)Ti,t\u22121 \u2264 2c(cid:80)m\nt=1 \u2206kt \u2264 c(cid:80)m\nbound(cid:80)T\nt\u22651,Ti,t\u2264Li(T )(ki,t \u2212 1)/2(cid:112)Ti,t\u22121 \u2264\n(cid:80)\nwe can only bound the regret as (cid:80)T\n(cid:112)Li(T ), leading to an extra factor of (K(cid:48) \u2212 1)/2.\n(cid:80)m\n\n2c K(cid:48)\u22121\nJusti\ufb01cation for Algo. 3. In Algo. 3, we only use the members in current round to update the estima-\ntor. This is practical for the situation where the member identi\ufb01ers are changing in different rounds\nfor privacy protection. Privacy gains much attention these days. Consider the online advertising\nscenario we explain in the introduction. Whenever a user clicks an advertisement, the advertiser\nwould store the user information (e.g. Facebook ID, IP address etc.) to identify the user and correlated\nwith past visits of the user. If such user identi\ufb01ers are \ufb01xed and do not change, the advertiser could\neasily track user behavior, which may result in privacy leak. A reasonable protection for users is to\nperiodically change user IDs (e.g. Facebook can periodically change user hash IDs, or users adopt\ndynamic IP addresses, etc.), so that it is dif\ufb01cult for the advertiser to track the same user over a long\nperiod of time. Under such situation, it may be likely that our learning algorithm can still detect ID\ncollisions within the short period of each learning round, but cross different rounds, collisions may\nnot be detectable due to ID changes.\nFull information feedback. Now we consider the scenario where the member identi\ufb01ers are \ufb01xed\nover all rounds, and design an algorithm with a constant regret bound. Our idea is to ensure that\nwe can observe at least one pair of members in every community Ci in each round t. We call such\nguarantee as full information feedback. If we only use members revealed in current round, we cannot\nachieve this goal since we have no observation of new pairs for a community Ci when ki = 1. To\nachieve full information feedback, we use at least one sample from the previous round to form a\npair with a sample in the current round to generate a valid pair collision observation. In particular,\nwe revise the Line 3, 6, and 7 as follows. Here we use u0 to represent the last member in Si in the\nprevious round (let u0 = null when t = 1) and ux(x > 0) to represent the x-th members in Si in the\ncurrent round. The revision of Line 3 implies that we use the empirical mean \u02c6\u00b5i = Xi/Ti instead of\nthe lower con\ufb01dence bound in the function CommunityExplore.\n\ni=1\n\nLine 3: For i \u2208 [m], \u03c1i = 0; Line 6: For i \u2208 [m], Ti \u2190 Ti + |Si| \u2212 1{t = 1},\nLine 7: For i \u2208 [m], Xi \u2190 Xi +\n\n1{ux = ux+1}.\n\n(cid:88)|Si|\u22121\n\nx=0\n\n(7)\n\nTheorem 4. With the full information feedback revision in Eq. (7), Algo. 3 with non-adaptive\nexploration method has a constant regret bound. Speci\ufb01cally,\n\nReg\u00b5(T ) \u2264(cid:0)2 + 2me2K(cid:48)2(K(cid:48) \u2212 1)2/\u22062\n\nmin\n\n(cid:1) \u2206max.\n\nNote that we cannot apply the Hoeffding bound in [14] directly since the random variables 1{ux =\nux+1} we obtain during the online learning process are not mutually independent. Instead, we apply\na concentration bound in [9] that is applicable to variables that have local dependence relationship.\n\n7\n\n\f4.2 Regret Analysis for the Adaptive Version\n\nmin and \u2206(k)\n\n\u00b5t into the adaptive community\nFor the adaptive version, we feed the lower con\ufb01dence bound\nexploration procedure, namely CommunityExplore({\n\u00b5m}, K, adaptive) in round t. We\n\u00af\n\u00b51, . . . ,\n\u00af\n\u00af\ndenote the policy implemented by this procedure as \u03c0t. Note that both \u03c0g and \u03c0t are based on\nthe greedy procedure CommunityExplore(\u00b7, K, adaptive). The difference is that \u03c0g uses the true\n\u00b5t. More speci\ufb01cally, given a partial realization\nparameter \u00b5 while \u03c0t uses the lower bound parameter\n\u03c8, the community chosen by \u03c0t is Ci\u2217 where i\u2217 \u2208 arg maxi\u2208[m] 1 \u2212 ci(\u03c8)\n\u00af\n\u00b5i,t. Recall that ci(\u03c8) is\n\u00af\nthe number of distinct encountered members in community Ci under partial realization \u03c8.\nWe \ufb01rst properly de\ufb01ne the metrics \u2206i,k\nmax used in the regret bound as follows. Consider\na speci\ufb01c full realization \u03c6 where {\u03c6(i, 1), . . . , \u03c6(i, di)} are di distinct members in Ci for i \u2208\n[m]. The realization \u03c6 indicates that we will obtain a new member in the \ufb01rst di exploration of\ncommunity Ci. Let Ui,k denote the number of times community Ci is selected by policy \u03c0g in\nthe \ufb01rst k \u2212 1(k > m) steps under the special full realization \u03c6 we de\ufb01ne previously. We de\ufb01ne\nmin = (\u00b5iUi,k \u2212 minj\u2208[m] \u00b5jUj,k)/Ui,k. Conceptually, the value \u00b5iUi,k \u2212 minj\u2208[m] \u00b5jUj,k is gap\n\u2206i,k\nin the expected reward of the next step between selecting a community by \u03c0g (the optimal policy)\nand selecting community Ci, when we already meet Uj,k distinct members in Cj for j \u2208 [m]. When\nmin = \u221e. Let \u03c0 be another policy that chooses the same\n\u00b5iUi,k = minj\u2208[m] \u00b5jUj,k, we de\ufb01ne \u2206i,k\nsequence of communities as \u03c0g when the number of met members in Ci is no more than Ui,k for all\ni \u2208 [m]. Note that policy \u03c0 chooses the same communities as \u03c0g in the \ufb01rst k \u2212 1 steps under the\nspecial full realization \u03c6. Actually, the policy \u03c0 is the same as \u03c0g for at least k \u2212 1 steps. We use \u03a0k\nto denote the set of all such policies. We de\ufb01ne \u2206(k)\nmax as the maximum reward gap between the policy\n\u03c0 \u2208 \u03a0k and the optimal policy \u03c0g, i.e., \u2206(k)\nTheorem 5. Algo. 3 with adaptive exploration method has regret as follows.\n\nmax = max\u03c0\u2208\u03a0k r\u03c0g (\u00b5) \u2212 r\u03c0(\u00b5). Let D =(cid:80)m\n\ni=1 di.\n\nmin{K,D}(cid:88)\n\ni=1\n\nk=m+1\n\n\u2206(k)\n\nmax.\n\n(8)\n\ni=1\n\nk=m+1\n\n(cid:98) K(cid:48)\n2 (cid:99)\u03c02\n3\n\n6\u2206(k)\nmax\n(\u2206i,k\nmin)2\n\n\uf8eb\uf8ed m(cid:88)\nmin{K,D}(cid:88)\nReg\u00b5(T ) \u2264(cid:88)m\n\n\uf8f6\uf8f8 ln T +\nm(cid:88)\n(cid:88)min{K,D}\n(cid:0)2/\u03b54\ni,k + 1(cid:1) \u2206(k)\nk \u2208 arg mini\u2208[m] \u00b5iUi,k)\nUi\u2217\n\nk,k)/(Ui,k + Ui\u2217\n\ni=1\n\nk=m+1\n\nmax.\n\nk,k) for i (cid:54)= i\u2217\n\nk and \u03b5i,k = \u221e for i = i\u2217\nk.\n\nTheorem 6. With the full information feedback revision in Eq. (7), Algo. 3 with adaptive exploration\nmethod has a constant regret bound. Speci\ufb01cally,\n\nReg\u00b5(T ) \u2264\n\nwhere \u03b5i,k is de\ufb01ned as (here i\u2217\n\u03b5i,k (cid:44) (\u00b5iUi,k \u2212 \u00b5i\u2217\n\nk\n\nmax. Their version of \u2206i,k\n\nGabillon et al. [11] analyzes a general adaptive submodular function maximization in bandit setting.\nWe have a regret bound in similar form as (8) if we directly apply Theorem 1 in [11]. However, their\nversion of \u2206(k)\nmax is an upper bound on the expected reward of policy \u03c0g from k steps forward, which\nmin is the minimum (\u00b5ici(\u03c8)\u2212 minj\u2208[m] \u00b5jcj(\u03c8))/ci(\u03c8)\nis larger than our \u2206(k)\nfor all partial realization \u03c8 obtained after policy \u03c0g is executed for k steps, which is smaller than\nour \u2206i,k\nmin. Our regret analysis is based on counting how many times \u03c0g and \u03c0t choose different\ncommunities under the special full realization \u03c6, while the analysis in [11] is based on counting how\nmany times \u03c0g and \u03c0t choose different communities under all possible full realizations.\nDiscussion. In this paper, we consider the online learning problem that consists of T rounds, and\nduring each round, we explore the communities with a budget K. Our goal is to maximize the\ncumulative reward in T rounds. Another important and natural setting is described as follows. We\nstart to explore communities with unknown sizes, and update the parameters every time we explore\nthe community for one step (or for a few steps). Different from the setting de\ufb01ned in this paper, here\na member will not contribute to the reward if it has been met in previous rounds. To differentiate\nthe two settings, let\u2019s call the latter one the \u201cinteractive community exploration\u201d, while the former\none the \u201crepeated community exploration\u201d. Both the repeated community exploration de\ufb01ned in this\npaper and the interactive community exploration we will study as the future work have corresponding\napplications. The former is suitable for online advertising where in each round the advertiser promotes\ndifferent products. Hence the rewards in different rounds are additive. The latter corresponds to\nthe adaptive online advertising for the same product, and thus the rewards in different rounds are\ndependent.\n\n8\n\n\f5 Related Work\nGolovin and Krause [13] show that a greedy policy could achieve at least (1 \u2212 1/e) approximation\nfor the adaptive submodular function. The result could be applied to our of\ufb02ine adaptive problem, but\nby an independent analysis we show the better result that the greedy policy is optimal. Multi-armed\nbandit (MAB) problem is initiated by Robbins [18] and extensively studied in [2, 4, 19]. Our online\nlearning algorithm is based on the extensively studied Upper Con\ufb01dence Bound approach [1]. The\nnon-adaptive community exploration problem in the online setting \ufb01ts into the general combinatorial\nmulti-armed bandit (CMAB) framework [6, 7, 12, 17, 20], where the reward is a set function of base\narms. The CMAB problem is \ufb01rst studied in [12], and its regret bound is improved by [7, 17]. We\nleverage the analysis framework in [7, 20] and prove a tighter bound for our algorithm. Gabillon et al.\n[11] de\ufb01ne an adaptive submodular maximization problem in bandit setting. Our online adaptive\nexploration problem is a instance of the problem de\ufb01ned in [11]. We prove a tighter bound than the\none in [11] by using the properties of our problem.\nOur model bears similarities to the optimal discovery problem proposed in [5] such as we both have\ndisjoint assumption, and both try to maximize the number of target elements. However, there are also\nsome differences: (a) We use different estimators for our critical parameters, because our problem\nsetting is different. (b) Their online model is closer to the interactive community exploration we\nexplained in 4.2 , while our online model is on repeated community exploration. As explained in 4.2,\nthe two online models serve different applications and have different algorithms and analyses. (c) We\nalso have more comprehensive studies on the of\ufb02ine cases.\n\n6 Future Work\n\nIn this paper, we systematically study the community exploration problems. In the of\ufb02ine setting,\nwe propose the greedy methods for both of non-adaptive and adaptive exploration problems. The\noptimality of the greedy methods are rigorously proved. We also analyze the online setting where the\ncommunity sizes are unknown initially. We provide a CLCB algorithm for the online community\nexploration. The algorithm has O(log T ) regret bound. If we further allow the full information\nfeedback, the CLCB algorithm with some minor revisions has a constant regret.\nOur study opens up a number of possible future directions. For example, we can consider various\nextensions to the problem model, such as more complicated distributions of member meeting prob-\nabilities, overlapping communities, or even graph structures between communities. We could also\nstudy the gap between non-adaptive and adaptive solutions.\n\nAcknowledgments\n\nWe thank Jing Yu from School of Mathematical Sciences at Fudan University for her insightful\ndiscussion on the of\ufb02ine problems, especially, we thank Jing Yu for her method to \ufb01nd a good initial\nallocation, which leads to a faster greedy method. Wei Chen is partially supported by the National\nNatural Science Foundation of China (Grant No. 61433014). The work of John C.S. Lui is supported\nin part by the GRF Grant 14208816.\n\nReferences\n[1] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine learning, 47(2-3):235\u2013256, 2002.\n\n[2] Donald A Berry and Bert Fristedt. Bandit problems: sequential allocation of experiments.\n\nChapman and Hall, 5:71\u201387, 1985.\n\n[3] Marco Bressan, Enoch Peserico, and Luca Pretto. Simple set cardinality estimation through\n\nrandom sampling. arXiv preprint arXiv:1512.07901, 2015.\n\n[4] S\u00e9bastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic\nmulti-armed bandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122,\n2012.\n\n9\n\n\f[5] S\u00e9bastien Bubeck, Damien Ernst, and Aur\u00e9lien Garivier. Optimal discovery with probabilistic\nexpert advice: \ufb01nite time analysis and macroscopic optimality. JMLR, 14(Feb):601\u2013623, 2013.\n\n[6] Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, and Pinyan Lu. Combinatorial multi-armed bandit\n\nwith general reward functions. In NIPS, pages 1659\u20131667, 2016.\n\n[7] Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Combinatorial multi-armed bandit and\nits extension to probabilistically triggered arms. Journal of Machine Learning Research, 17\n(50):1\u201333, 2016. A preliminary version appeared as Chen, Wang, and Yuan, \u201cCombinatorial\nmulti-armed bandit: General framework, results and applications\u201d, ICML\u20192013.\n\n[8] Mary C Christman and Tapan K Nayak. Sequential unbiased estimation of the number of classes\n\nin a population. Statistica Sinica, pages 335\u2013352, 1994.\n\n[9] Devdatt Dubhashi and Alessandro Panconesi. Concentration of Measure for the Analysis of\n\nRandomized Algorithms. Cambridge University Press, 1st edition, 2009.\n\n[10] Mark Finkelstein, Howard G. Tucker, and Jerry Alan Veeh. Con\ufb01dence intervals for the number\n\nof unseen types. Statistics & Probability Letters, pages 423 \u2013 430, 1998.\n\n[11] Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson, and S Muthukrishnan. Adaptive\n\nsubmodular maximization in bandit setting. In NIPS, pages 2697\u20132705, 2013.\n\n[12] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with\nunknown variables: Multi-armed bandits with linear rewards and individual observations.\nIEEE/ACM Trans. Netw., 20(5):1466\u20131478, 2012.\n\n[13] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in\nactive learning and stochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 42:\n427\u2013486, 2011.\n\n[14] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of\n\nthe American statistical association, 58(301):13\u201330, 1963.\n\n[15] Svante Janson. Large deviations for sums of partly dependent random variables. Random\n\nStructures & Algorithms, 24(3):234\u2013248, 2004.\n\n[16] Liran Katzir, Edo Liberty, and Oren Somekh. Estimating sizes of social networks via biased\n\nsampling. In WWW, 2011.\n\n[17] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for\nstochastic combinatorial semi-bandits. In Arti\ufb01cial Intelligence and Statistics, pages 535\u2013543,\n2015.\n\n[18] Herbert Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins\n\nSelected Papers, pages 169\u2013177. Springer, 1985.\n\n[19] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1.\n\nMIT press Cambridge, 1998.\n\n[20] Qinshi Wang and Wei Chen. Improving regret bounds for combinatorial semi-bandits with\n\nprobabilistically triggered arms and its applications. In NIPS, pages 1161\u20131171, 2017.\n\n10\n\n\f", "award": [], "sourceid": 2637, "authors": [{"given_name": "Xiaowei", "family_name": "Chen", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Weiran", "family_name": "Huang", "institution": "Huawei Noah's Ark Lab"}, {"given_name": "Wei", "family_name": "Chen", "institution": "Microsoft Research"}, {"given_name": "John C. S.", "family_name": "Lui", "institution": "The Chinese University of Hong Kong"}]}