{"title": "Improving Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms and Its Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 1161, "page_last": 1171, "abstract": "We study combinatorial multi-armed bandit with probabilistically triggered arms (CMAB-T) and semi-bandit feedback. We resolve a serious issue in the prior CMAB-T studies where the regret bounds contain a possibly exponentially large factor of 1/p*, where p* is the minimum positive probability that an arm is triggered by any action. We address this issue by introducing a triggering probability modulated (TPM) bounded smoothness condition into the influence maximization bandit and combinatorial cascading bandit satisfy this TPM condition. As a result, we completely remove the factor of 1/p* from the regret bounds, achieving significantly better regret bounds for influence maximization and cascading bandits than before. Finally, we provide lower bound results showing that the factor 1/p* is unavoidable for general CMAB-T problems, suggesting that the TPM condition is crucial in removing this factor.", "full_text": "Improving Regret Bounds for Combinatorial\n\nSemi-Bandits with Probabilistically Triggered Arms\n\nand Its Applications\n\nQinshi Wang\n\nPrinceton University\nPrinceton, NJ 08544\n\nqinshiw@princeton.edu\n\nWei Chen\n\nMicrosoft Research\n\nBeijing, China\n\nweic@microsoft.com\n\nAbstract\n\nWe study combinatorial multi-armed bandit with probabilistically triggered arms\nand semi-bandit feedback (CMAB-T). We resolve a serious issue in the prior\nCMAB-T studies where the regret bounds contain a possibly exponentially large\nfactor of 1/p\u2217, where p\u2217 is the minimum positive probability that an arm is trig-\ngered by any action. We address this issue by introducing a triggering probability\nmodulated (TPM) bounded smoothness condition into the general CMAB-T fra-\nmework, and show that many applications such as in\ufb02uence maximization bandit\nand combinatorial cascading bandit satisfy this TPM condition. As a result, we\ncompletely remove the factor of 1/p\u2217 from the regret bounds, achieving signi\ufb01-\ncantly better regret bounds for in\ufb02uence maximization and cascading bandits than\nbefore. Finally, we provide lower bound results showing that the factor 1/p\u2217 is\nunavoidable for general CMAB-T problems, suggesting that the TPM condition is\ncrucial in removing this factor.\n\n1\n\nIntroduction\n\nStochastic multi-armed bandit (MAB) is a classical online learning framework modeled as a game\nbetween a player and the environment with m arms. In each round, the player selects one arm and\nthe environment generates a reward of the arm from a distribution unknown to the player. The player\nobserves the reward, and use it as the feedback to the player\u2019s algorithm (or policy) to select arms in\nfuture rounds. The goal of the player is to cumulate as much reward as possible over time. MAB\nmodels the classical dilemma between exploration and exploitation: whether the player should keep\nexploring arms in search for a better arm, or should stick to the best arm observed so far to collect\nrewards. The standard performance measure of the player\u2019s algorithm is the (expected) regret, which\nis the difference in expected cumulative reward between always playing the best arm in expectation\nand playing according to the player\u2019s algorithm.\nIn recent years, stochastic combinatorial multi-armed bandit (CMAB) receives many attention (e.g.\n[9, 7, 6, 10, 13, 15, 14, 16, 8]), because it has wide applications in wireless networking, online\nadvertising and recommendation, viral marketing in social networks, etc. In the typical setting of\nCMAB, the player selects a combinatorial action to play in each round, which would trigger the\nplay of a set of arms, and the outcomes of these triggered arms are observed as the feedback (called\nsemi-bandit feedback). Besides the exploration and exploitation tradeoff, CMAB also needs to deal\nwith the exponential explosion of the possible actions that makes exploring all actions infeasible.\nOne class of the above CMAB problems involves probabilistically triggered arms [7, 14, 16], in\nwhich actions may trigger arms probabilistically. We denote it as CMAB-T in this paper. Chen et al.\n[7] provide such a general model and apply it to the in\ufb02uence maximization bandit, which models\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fstochastic in\ufb02uence diffusion in social networks and sequentially selecting seed sets to maximize\nthe cumulative in\ufb02uence spread over time. Kveton et al. [14, 16] study cascading bandits, in which\narms are probabilistically triggered following a sequential order selected by the player as the action.\nHowever, in both studies, the regret bounds contain an undesirable factor of 1/p\u2217, where p\u2217 is the\nminimum positive probability that any arm can be triggered by any action,1 and this factor could be\nexponentially large for both in\ufb02uence maximization and cascading bandits.\nIn this paper, we adapt the general CMAB framework of [7] in a systematic way to completely remove\nthe factor of 1/p\u2217 for a large class of CMAB-T problems including both in\ufb02uence maximization and\ncombinatorial cascading bandits. The key observation is that for these problems, a harder-to-trigger\narm has less impact to the expected reward and thus we do not need to observe it as often. We turn\nthis key observation into a triggering probability modulated (TPM) bounded smoothness condition,\nadapted from the original bounded smoothness condition in [7]. We eliminates the 1/p\u2217 factor\nin the regret bounds for all CMAB-T problems with the TPM condition, and show that in\ufb02uence\nmaximization bandit and the conjunctive/disjunctive cascading bandits all satisfy the TPM condition.\nMoreover, for general CMAB-T without the TPM condition, we show a lower bound result that 1/p\u2217\nis unavoidable, because the hard-to-trigger arms are crucial in determining the best arm and have to\nbe observed enough times.\nBesides removing the exponential factor, our analysis is also tighter in other regret factors or constants\ncomparing to the existing in\ufb02uence maximization bandit results [7, 25], combinatorial cascading\nbandit [16], and linear bandits without probabilistically triggered arms [15]. Both the regret analysis\nbased on the TPM condition and the proof that in\ufb02uence maximization bandit satis\ufb01es the TPM\ncondition are technically involved and nontrivial, but due to the space constraint, we have to move\nthe complete proofs to the supplementary material. Instead we introduce the key techniques used in\nthe main text.\n\nRelated Work. Multi-armed bandit problem is originally formated by Robbins [20], and has been\nextensively studied in the literature [cf. 3, 21, 4]. Our study belongs to the stochastic bandit research,\nwhile there is another line of research on adversarial bandits [2], for which we refer to a survey\nlike [4] for further information. For stochastic MABs, an important approach is Upper Con\ufb01dence\nBound (UCB) approach [1], on which most CMAB studies are based upon.\nAs already mentioned in the introduction, stochastic CMAB has received many attention in recent\nyears. Among the studies, we improve (a) the general framework with probabilistically triggered arms\nof [7], (b) the in\ufb02uence maximization bandit results in [7] and [25], (c) the combinatorial cascading\nbandit results in [16], and (d) the linear bandit results in [15]. We defer the technical comparison\nwith these studies to Section 4.3. Other CMAB studies do not deal with probabilistically triggered\narms. Among them, [9] is the \ufb01rst study on linear stochastic bandit, but its regret bound has since\nbeen improved by Chen et al. [7], Kveton et al. [15]. Combes et al. [8] improve the regret bound of\n[15] for linear bandits in a special case where arms are mutually independent. Most studies above\nare based on the UCB-style CUCB algorithm or its minor variant, and differ on the assumptions and\nregret analysis. Gopalan et al. [10] study Thompson sampling for complex actions, which is based on\nthe Thompson sample approach [22] and can be applied to CMAB, but their regret bound has a large\nexponential constant term.\nIn\ufb02uence maximization is \ufb01rst formulated as a discrete optimization problem by Kempe et al. [12],\nand has been extensively studied since (cf. [5]). Variants of in\ufb02uence maximization bandit have also\nbeen studied [18, 23, 24]. Lei et al. [18] use a different objective of maximizing the expected size of\nthe union of the in\ufb02uenced nodes over time. Vaswani et al. [23] discuss how to transfer node level\nfeedback to the edge level feedback, and then apply the result of [7]. Vaswani et al. [24] replace\nthe original maximization objective of in\ufb02uence spread with a heuristic surrogate function, avoiding\nthe issue of probabilistically triggered arms. But their regret is de\ufb01ned against a weaker benchmark\nrelaxed by the approximation ratio of the surrogate function, and thus their theoretical result is weaker\nthan ours.\n\n1The factor of 1/f\u2217 used for the combinatorial disjunctive cascading bandits in [16] is essentially 1/p\u2217.\n\n2\n\n\f2 General Framework\n\ni\n\n1 , . . . , X (t)\n\nIn this section we present the general framework of combinatorial multi-armed bandit with probabi-\nlistically triggered arms originally proposed in [7] with a slight adaptation, and denote it as CMAB-T.\nWe illustrate that the in\ufb02uence maximization bandit [7] and combinatorial cascading bandits [14, 16]\nare example instances of CMAB-T.\nCMAB-T is described as a learning game between a learning agent (or player) and the environment.\nThe environment consists of m random variables X1, . . . , Xm called base arms (or arms) following\na joint distribution D over [0, 1]m. Distribution D is picked by the environment from a class of\ndistributions D before the game starts. The player knows D but not the actual distribution D.\nThe learning process proceeds in discrete rounds. In round t \u2265 1, the player selects an action St from\nan action space S based on the feedback history from the previous rounds, and the environment draws\nfrom the joint distribution D an independent sample X (t) = (X (t)\nm ). When action St is\nplayed on the environment outcome X (t), a random subset of arms \u03c4t \u2286 [m] are triggered, and the\nfor all i \u2208 \u03c4t are observed as the feedback to the player. The player also obtains a\noutcomes of X (t)\nnonnegative reward R(St, X (t), \u03c4t) fully determined by St, X (t), and \u03c4t. A learning algorithm aims\nat properly selecting actions St\u2019s over time based on the past feedback to cumulate as much reward\nas possible. Different from [7], we allow the action space S to be in\ufb01nite. In the supplementary\nmaterial, we discuss an example of continuous in\ufb02uence maximization [26] that uses continuous and\nin\ufb01nite action space while the number of base arms is still \ufb01nite.\nWe now describe the triggered set \u03c4t in more detail, which is not explicit in [7]. In general, \u03c4t may\nhave additional randomness beyond the randomness of X (t). Let Dtrig(S, X) denote a distribution\nof the triggered subset of [m] for a given action S and an environment outcome X. We assume that \u03c4t\nis drawn independently from Dtrig(St, X (t)). We refer Dtrig as the probabilistic triggering function.\nTo summarize, a CMAB-T problem instance is a tuple ([m],S,D, Dtrig, R), with elements already\ndescribed above. These elements are known to the player, and hence establishing the problem input\nto the player. In contrast, the environment instance is the actual distribution D \u2208 D picked by the\nenvironment, and is unknown to the player. The problem instance and the environment instance\ntogether form the (learning) game instance, in which the learning process would unfold. In this paper,\nwe \ufb01x the environment instance D, unless we need to refer to more than one environment instances.\nFor each arm i, let \u00b5i = EX\u223cD[Xi]. Let vector \u00b5 = (\u00b51, . . . , \u00b5m) denote the expectation vector of\narms. Note that vector \u00b5 is determined by D. Same as in [7], we assume that the expected reward\nE[R(S, X, \u03c4 )], where the expectation is taken over X \u223c D and \u03c4 \u223c Dtrig(S, X), is a function of\naction S and the expectation vector \u00b5 of the arms. Henceforth, we denote rS(\u00b5) (cid:44) E[R(S, X, \u03c4 )].\nWe remark that Chen et al. [6] relax the above assumption and consider the case where the entire\ndistribution D, not just the mean of D, is needed to determine the expected reward. However, they\nneed to assume that arm outcomes are mutually independent, and they do not consider probabilistically\ntriggered arms. It might be interesting to incorporate probabilistically triggered arms into their setting,\nbut this is out of the scope of the current paper. To allow algorithm to estimate \u00b5i directly from\nsamples, we assume the outcome of an arm does not depend on whether itself is triggered, i.e.\nEX\u223cD,\u03c4\u223cDtrig(S,X)[Xi | i \u2208 \u03c4 ] = EX\u223cD[Xi].\nThe performance of a learning algorithm A is measured by its (expected) regret, which is the difference\nin expected cumulative reward between always playing the best action and playing actions selected\nby algorithm A. Formally, let opt\u00b5 = supS\u2208S rS(\u00b5), where \u00b5 = EX\u223cD[X], and we assume\nthat opt\u00b5 is \ufb01nite. Same as in [7], we assume that the learning algorithm has access to an of\ufb02ine\n(\u03b1, \u03b2)-approximation oracle O, which takes \u00b5 = (\u00b51, . . . , \u00b5m) as input and outputs an action SO\nsuch that Pr{r\u00b5(SO) \u2265 \u03b1 \u00b7 opt\u00b5} \u2265 \u03b2, where \u03b1 is the approximation ratio and \u03b2 is the success\nprobability. Under the (\u03b1, \u03b2)-approximation oracle, the benchmark cumulative reward should be the\n\u03b1\u03b2 fraction of the optimal reward, and thus we use the following (\u03b1, \u03b2)-approximation regret:\n\nDe\ufb01nition 1 ((\u03b1, \u03b2)-approximation Regret). The T -round (\u03b1, \u03b2)-approximation regret of a le-\narning algorithm A (using an (\u03b1, \u03b2)-approximation oracle) for a CMAB-T game instance\n\n3\n\n\f([m],S,D, Dtrig, R, D) with \u00b5 = EX\u223cD[X] is\n\nRegA\n\n\u00b5,\u03b1,\u03b2(T ) = T \u00b7 \u03b1\u00b7 \u03b2 \u00b7 opt\u00b5 \u2212 E\n\nR(SA\n\nt , X (t), \u03c4t)\n\n(cid:34) T(cid:88)\n\ni=1\n\n(cid:35)\n\n= T \u00b7 \u03b1\u00b7 \u03b2 \u00b7 opt\u00b5 \u2212 E\n\n(cid:35)\n\nrSA\n\nt\n\n(\u00b5)\n\n,\n\n(cid:34) T(cid:88)\n\ni=1\n\nwhere SA\nis the action A selects in round t, and the expectation is taken over the randomness of\nt\nthe environment outcomes X (1), . . . , X (T ), the triggered sets \u03c41, . . . , \u03c4T , as well as the possible\nrandomness of algorithm A itself.\nWe remark that because probabilistically triggered arms may strongly impact the determination of\nthe best action, but they may be hard to trigger and observe, the regret could be worse and the regret\nanalysis is in general harder than CMAB without probabilistically triggered arms.\nThe above framework essentially follows [7], but we decouple actions from subsets of arms, allow\naction space to be in\ufb01nite, and explicitly model triggered set distribution, which makes the framework\nmore powerful in modeling certain applications (see supplementary material for more discussions).\n\n2.1 Examples of CMAB-T: In\ufb02uence Maximization and Cascading Bandits\n\nIn social in\ufb02uence maximization [12], we are given a weighted directed graph G = (V, E, p), where\nV and E are sets of vertices and edges respectively, and each edge (u, v) is associated with a\nprobability p(u, v). Starting from a seed set S \u2286 V , in\ufb02uence propagates in G as follows: nodes\nin S are activated at time 0, and at time t \u2265 1, a node u activated in step t \u2212 1 has one chance to\nactivate its inactive out-neighbor v with an independent probability p(u, v). The in\ufb02uence spread of\nseed set S, \u03c3(S), is the expected number of activated nodes after the propagation ends. The of\ufb02ine\nproblem of in\ufb02uence maximization is to \ufb01nd at most k seed nodes in G such that the in\ufb02uence spread\nis maximized. Kempe et al. [12] provide a greedy algorithm with approximation ratio 1 \u2212 1/e \u2212 \u03b5\nand success probability 1 \u2212 1/|V |, for any \u03b5 > 0.\nFor the online in\ufb02uence maximization bandit [7], the edge probabilities p(u, v)\u2019s are unknown and\nneed to be learned over time through repeated in\ufb02uence maximization tasks: in each round t, k seed\nnodes St are selected, the in\ufb02uence propagation from St is observed, the reward is the number of\nnodes activated in this round, and one wants to repeat this process to cumulate as much reward as\npossible. Putting it into the CMAB-T framework, the set of edges E is the set of arms [m], and their\noutcome distribution D is the joint distribution of m independent Bernoulli distributions with means\np(u, v) for all (u, v) \u2208 E. Any seed set S \u2286 V with at most k nodes is an action. The triggered\narm set \u03c4t is the set of edges (u, v) reached by the propagation, that is, u can be reached from St\nby passing through only edges e \u2208 E with X (t)\ne = 1. In this case, the distribution Dtrig(St, X (t))\ndegenerates to a deterministic triggered set. The reward R(St, X (t), \u03c4t) equals to the number of nodes\nin V that is reached from S through only edges e \u2208 E with X (t)\ne = 1, and the expected reward is\nexactly the in\ufb02uence spread \u03c3(St). The of\ufb02ine oracle is a (1\u2212 1/e\u2212 \u03b5, 1/|V |)-approximation greedy\nalgorithm. We remark that the general triggered set distribution Dtrig(St, X (t)) (together with in\ufb01nite\naction space) can be used to model extended versions of in\ufb02uence maximization, such as randomly\nselected seed sets in general marketing actions [12] and continuous in\ufb02uence maximization [26] (see\nsupplementary material).\nNow let us consider combinatorial cascading bandits [14, 16]. In this case, we have m independent\nBernoulli random variables X1, . . . , Xm as base arms. An action is to select an ordered sequence\nfrom a subset of these arms satisfying certain constraint. Playing this action means that the player\nreveals the outcomes of the arms one by one following the sequence order until certain stopping\ncondition is satis\ufb01ed. The feedback is the outcomes of revealed arms and the reward is a function\nform of these arms. In particular, in the disjunctive form the player stops when the \ufb01rst 1 is revealed\nand she gains reward of 1, or she reaches the end and gains reward 0. In the conjunctive form, the\nplayer stops when the \ufb01rst 0 is revealed (and receives reward 0) or she reaches the end with all 1\noutcomes (and receives reward 1). Cascading bandits can be used to model online recommendation\nand advertising (in the disjunctive form with outcome 1 as a click) or network routing reliability (in\nthe conjunctive form with outcome 0 as the routing edge being broken). It is straightforward to see\nthat cascading bandits \ufb01t into the CMAB-T framework: m variables are base arms, ordered sequences\nare actions, and the triggered set is the pre\ufb01x set of arms until the stopping condition holds.\n\n4\n\n\fFor each arm i \u2208 [m], \u03c1i \u2190(cid:113) 3 ln t\n\nAlgorithm 1 CUCB with computation oracle.\nInput: m, Oracle\n1: For each arm i, Ti \u2190 0 {maintain the total number of times arm i is played so far}\n2: For each arm i, \u02c6\u00b5i \u2190 1 {maintain the empirical mean of Xi}\n3: for t = 1, 2, 3, . . . do\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nFor each arm i \u2208 [m], \u00af\u00b5i = min{\u02c6\u00b5i + \u03c1i, 1} {the upper con\ufb01dence bound}\nS \u2190 Oracle(\u00af\u00b51, . . . , \u00af\u00b5m)\nPlay action S, which triggers a set \u03c4 \u2286 [m] of base arms with feedback X (t)\ni \u2212 \u02c6\u00b5i)/Ti\nFor every i \u2208 \u03c4, update Ti and \u02c6\u00b5i: Ti = Ti + 1, \u02c6\u00b5i = \u02c6\u00b5i + (X (t)\n\n{the con\ufb01dence radius, \u03c1i = +\u221e if Ti = 0}\n\n2Ti\n\ni\n\n\u2019s, i \u2208 \u03c4\n\n3 Triggering Probability Modulated Condition\n\nChen et al. [7] use two conditions to guarantee the theoretical regret bounds. The \ufb01rst one is\nmonotonicity, which we also use in this paper, and is restated below.\nCondition 1 (Monotonicity). We say that a CMAB-T problem instance satis\ufb01es monotonicity, if for\nany action S \u2208 S, for any two distributions D, D(cid:48) \u2208 D with expectation vectors \u00b5 = (\u00b51, . . . , \u00b5m)\nand \u00b5(cid:48) = (\u00b5(cid:48)\n\nm), we have rS(\u00b5) \u2264 rS(\u00b5(cid:48)) if \u00b5i \u2264 \u00b5(cid:48)\n\ni for all i \u2208 [m].\n\n1, . . . , \u00b5(cid:48)\n\nThe second condition is bounded smoothness. One key contribution of our paper is to properly\nstrengthen the original bounded smoothness condition in [7] so that we can both get rid of the\nundesired 1/p\u2217 term in the regret bound and guarantee that many CMAB problems still satisfy the\nconditions. Our important change is to use triggering probabilities to modulate the condition, and\nthus we call such conditions triggering probability modulated (TPM) conditions. The key point of\nTPM conditions is including the triggering probability in the condition. We use pD,S\nto denote the\nprobability that action S triggers arm i when the environment instance is D. With this de\ufb01nition,\nwe can also technically de\ufb01ne p\u2217 as p\u2217 = inf i\u2208[m],S\u2208S,pD,S\ni >0 pD,S\n. In this section, we further use\n1-norm based conditions instead of the in\ufb01nity-norm based condition in [7], since they lead to better\nregret bounds for the in\ufb02uence maximization and cascading bandits.\nCondition 2 (1-Norm TPM Bounded Smoothness). We say that a CMAB-T problem instance sa-\ntis\ufb01es 1-norm TPM bounded smoothness, if there exists B \u2208 R+ (referred as the bounded smoothness\nconstant) such that, for any two distributions D, D(cid:48) \u2208 D with expectation vectors \u00b5 and \u00b5(cid:48), and any\n\naction S, we have |rS(\u00b5) \u2212 rS(\u00b5(cid:48))| \u2264 B(cid:80)\n\n|\u00b5i \u2212 \u00b5(cid:48)\ni|.\n\ni\u2208[m] pD,S\n\ni\n\ni\n\ni\n\nNote that the corresponding non-TPM version of the above condition would remove pD,S\nin the\nabove condition, which is a generalization of the linear condition used in linear bandits [15]. Thus, the\nTPM version is clearly stronger than the non-TPM version (when the bounded smoothness constants\nare the same). The intuition of incorporating the triggering probability pD,S\nto modulate the 1-norm\ncondition is that, when an arm i is unlikely triggered by action S (small pD,S\n), the importance of\narm i also diminishes in that a large change in \u00b5i only causes a small change in the expected reward\nrS(\u00b5). This property sounds natural in many applications, and it is important for bandit learning \u2014\nalthough an arm i may be dif\ufb01cult to observe when playing S, it is also not important to the expected\nreward of S and thus does not need to be learned as accurately as others more easily triggered by S.\n\ni\n\ni\n\ni\n\n4 CUCB Algorithm and Regret Bound with TPM Bounded Smoothness\n\nWe use the same CUCB algorithm as in [7] (Algorithm 1). The algorithm maintains the empirical\nestimate \u02c6\u00b5i for the true mean \u00b5i, and feed the upper con\ufb01dence bound \u00af\u00b5i to the of\ufb02ine oracle to\nobtain the next action S to play. The upper con\ufb01dence bound \u00af\u00b5i is large if arm i is not triggered often\n(Ti is small), providing optimistic estimates for less observed arms. We next provide its regret bound.\n\n5\n\n\fDe\ufb01nition 2 (Gap). Fix a distribution D and its expectation vector \u00b5. For each action S, we de\ufb01ne\nthe gap \u2206S = max(0, \u03b1 \u00b7 opt\u00b5 \u2212 rS(\u00b5)). For each arm i, we de\ufb01ne\n\n\u2206i\n\nmin =\n\ninf\nS\u2208S:pD,S\n\ni >0,\u2206S >0\n\n\u2206S,\n\n\u2206i\n\nmax =\n\nS\u2208S:pD,S\n\nsup\ni >0,\u2206S >0\n\n\u2206S.\n\nmin, and \u2206max = maxi\u2208[m] \u2206i\n\ni > 0 and \u2206S > 0, we de\ufb01ne \u2206i\n\nAs a convention, if there is no action S such that pD,S\nmax = 0. We de\ufb01ne \u2206min = mini\u2208[m] \u2206i\n\u2206i\nLet \u02dcS = {i \u2208 [m] | p\u00b5,S\nFor convenience, we use (cid:100)x(cid:101)0 to denote max{(cid:100)x(cid:101), 0} for any real number x.\nTheorem 1. For the CUCB algorithm on a CMAB-T problem instance that satis\ufb01es monotonicity\n(Condition 1) and 1-norm TPM bounded smoothness (Condition 2) with bounded smoothness constant\nB, (1) if \u2206min > 0, we have distribution-dependent bound\n\ni > 0} be the set of arms that could be triggered by S. Let K = maxS\u2208S | \u02dcS|.\n\nmax.\n\nmin = +\u221e,\n\nReg\u00b5,\u03b1,\u03b2(T ) \u2264 (cid:88)\n\n576B2K ln T\n\n\u2206i\n\nmin\n\n+\n\ni\u2208[m]\n\n(2) we have distribution-independent bound\n\nReg\u00b5,\u03b1,\u03b2(T ) \u2264 12B\n\n\u221a\n\nmKT ln T +\n\nlog2\n\n(cid:19)\n\n(cid:18)(cid:24)\n\n(cid:88)\n(cid:18)(cid:24)\n\ni\u2208[m]\n\n2BK\n\u2206i\n\nmin\n\n(cid:25)\n\nlog2\n\nT\n\n18 ln T\n\n0\n\n(cid:25)\n\n0\n\n(cid:19)\n\n+ 2\n\n\u00b7 \u03c02\n6\n\n\u00b7 \u2206max + 4Bm;\n\n(1)\n\n+ 2\n\n\u00b7 m \u00b7 \u03c02\n6\n\n\u00b7 \u2206max + 2Bm.\n\u221a\n\n(2)\n\n\u2206\n\n\u2206\n\n\u221a\n\nmin = \u2206 for all i \u2208 [m] and \u2206i\n\nlog T ) since for that instance B = 1 and there are K arms with \u2206i\n\nFor the above theorem, we remark that the regret bounds are tight (up to a O(\nlog T ) factor in the\ncase of distribution-independent bound) base on a lower bound result in [15]. More speci\ufb01cally,\nKveton et al. [15] show that for linear bandits (a special class of CMAB-T without probabilis-\ntic triggering), the distribution-dependent regret is lower bounded by \u2126( (m\u2212K)K\nlog T ), and the\nmKT ) when T \u2265 m/K, for some in-\ndistribution-independent regret is lower bounded by \u2126(\nmin < \u221e. Comparing with our regret upper\nstance where \u2206i\nbound in the above theorem, (a) for distribution-dependent bound, we have the regret upper bound\nO( (m\u2212K)K\nmin = \u221e, so tight with\n\u221a\nthe lower bound in [15]; and (b) for distribution-independent bound, we have the regret upper bound\nlog T ) factor, same as the upper bound for\nO(\nthe linear bandits in [15]. This indicates that parameters m and K appeared in the above regret\nbounds are all needed. As for parameter B, we can view it simply as a scaling parameter. If we\nscale the reward of an instance to B times larger than before, certainly, the regret is B times larger.\nLooking at the distribution-dependent regret bound (Eq. (1)), \u2206i\nmin would also be scaled by a factor\nof B, canceling one B factor from B2, and \u2206max is also scaled by a factor of B, and thus the regret\nbound in Eq. (1) is also scaled by a factor of B. In the distribution-independent regret bound (Eq. (2)),\nthe scaling of B is more direct. Therefore, we can see that all parameters m, K, and B appearing in\nthe above regret bounds are needed. Finally, we remark that the TPM Condition 2 can be re\ufb01ned such\nthat B is replaced by arm-dependent Bi that is moved inside the summation, and B in Theorem 1 is\nreplaced with Bi accordingly. See the supplementary material for details.\n\nmKT log T ), tight to the lower bound up to a O(\n\n\u221a\n\n4.1 Novel Ideas in the Regret Analysis\n\nDue to the space limit, the full proof of Theorem 1 is moved to the supplementary material. Here\nwe brie\ufb02y explain the novel aspects of our analysis that allow us to achieve new regret bounds and\ndifferentiate us from previous analyses such as the ones in [7] and [16, 15].\nWe \ufb01rst give an intuitive explanation on how to incorporate the TPM bounded smoothness condition\nto remove the factor 1/p\u2217 in the regret bound. Consider a simple illustrative example of two actions\nS0 and S, where S0 has a \ufb01xed reward r0 as a reference action, and S has a stochastic reward\ndepending on the outcomes of its triggered base arms. Let \u02dcS be the set of arms that can be triggered\nby S. For i \u2208 \u02dcS, suppose i can be triggered by action S with probability pS\ni , and its true mean is \u00b5i\nand its empirical mean at the end of round t is \u02c6\u00b5i,t. The analysis in [7] would need a property that, if\nfor all i \u2208 \u02dcS |\u02c6\u00b5i,t \u2212 \u00b5i| \u2264 \u03b4i for some properly de\ufb01ned \u03b4i, then S no longer generates regrets. The\nanalysis would conclude that arm i needs to be triggered \u0398(log T /\u03b42\ni ) times for the above condition\n\n6\n\n\fi \u03b42\n\ni\u2208 \u02dcSt\n\ni >0 pS\n\ni log T /\u03b42\n\nS:pS\n\ni (\u03b4i/pS\n\ni )2)) = O(pS\n\ni log T /\u03b42\n\nto happen. Since arm i is only triggered with probability pS\ni , it means action S may need to be played\ni )) times. This is the essential reason why the factor 1/p\u2217 appears in the regret bound.\n\u0398(log T /(pS\nNow with the TPM bounded smoothness, we know that the impact of |\u02c6\u00b5i,t \u2212 \u00b5i| \u2264 \u03b4i to the\ndifference in the expected reward is only pS\ni \u03b4i, or equivalently, we could relax the requirement to\n|\u02c6\u00b5i,t \u2212 \u00b5i| \u2264 \u03b4i/pS\ni to achieve the same effect as in the previous analysis. This translates to the result\nthat action S would generate regret in at most O(log T /(pS\ni ) rounds.\nWe then need to handle the case when we have multiple actions that could trigger arm i. The\ni is not feasible since we may have exponentially or even\nin\ufb01nitely many such actions. Instead, we introduce the key idea of triggering probability groups,\nsuch that the above actions are divided into groups by putting their triggering probabilities pS\ninto geometrically separated bins: (1/2, 1], (1/4, 1/2] . . . , (2\u2212j, 2\u2212j+1], . . . The actions in the same\ni\ngroup would generate regret in at most O(2\u2212j+1 log T /\u03b42\ni ) rounds with a similar argument, and\ni ) = O(log T /\u03b42\ni )\n\nsimple addition of(cid:80)\nsumming up together, they could generate regret in at most O((cid:80)\n\nj 2\u2212j+1 log T /\u03b42\n\nB(cid:80)\n\ni or 1/p\u2217 is completely removed from the regret bound.\n\nrounds. Therefore, the factor of 1/pS\nNext, we brie\ufb02y explain our idea to achieve the improved bound over the linear bandit result\nin [15]. The key step is to bound regret \u2206St generated in round t. By a derivation similar to\n[15, 7] together with the 1-norm TPM bounded smoothness condition, we would obtain that \u2206St \u2264\n(\u00af\u00b5i,t \u2212 \u00b5i) with high probability. The analysis in [15] would analyze the errors\n|\u00af\u00b5i,t \u2212 \u00b5i| by a cascade of in\ufb01nitely many sub-cases of whether there are xj arms with errors larger\nthan yj with decreasing yj, but it may still be loose. Instead we directly work on the above summation.\nNaive bounding the about error summation would not give a O(log T ) bound because there could\nbe too many arms with small errors. Our trick is to use a reverse amortization: we cumulate small\nerrors on many suf\ufb01ciently sampled arms and treat them as errors of insuf\ufb01ciently sample arms, such\nthat an arm sampled O(log T ) times would not contribute toward the regret. This trick tightens our\nanalysis and leads to signi\ufb01cantly improved constant factors.\n\npD,St\ni\n\n4.2 Applications to In\ufb02uence Maximization and Combinatorial Cascading Bandits\n\nThe following two lemmas show that both the cascading bandits and the in\ufb02uence maximization\nbandit satisfy the TPM condition.\nLemma 1. For both disjunctive and conjunctive cascading bandit problem instances, 1-norm TPM\nbounded smoothness (Condition 2) holds with bounded smoothness constant B = 1.\nLemma 2. For the in\ufb02uence maximization bandit problem instances, 1-norm TPM bounded\nsmoothness (Condition 2) holds with bounded smoothness constant B = \u02dcC, where \u02dcC is the largest\nnumber of nodes any node can reach in the directed graph G = (V, E).\n\nThe proof of Lemma 1 involves a technique called bottom-up modi\ufb01cation. Each action in cascading\nbandits can be viewed as a chain from top to bottom. When changing the means of arms below, the\ntriggering probability of arms above is not changed. Thus, if we change \u00b5 to \u00b5(cid:48) backwards, the\ntriggering probability of each arm is unaffected before its expectation is changed, and when changing\nthe mean of an arm i, the expected reward of the action is at most changed by pD,S\nThe proof of Lemma 2 is more complex, since the bottom-up modi\ufb01cation does not work directly\non graphs with cycles. To circumvent this problem, we develop an in\ufb02uence tree decomposition\ntechnique as follows. First, we order all in\ufb02uence paths from the seed set S to a target v. Second,\neach edge is independently sampled based on its edge probability to form a random live-edge graph.\nThird, we divide the reward portion of activating v among all paths from S to v: for each live-edge\ngraph L in which v is reachable from S, assign the probability of L to the \ufb01rst path from S to v in L\naccording to the path total order. Finally, we compose all the paths from S to v into a tree with S\nas the root and copies of v as the leaves, so that we can do bottom-up modi\ufb01cation on this tree and\nproperly trace the reward changes based on the reward division we made among the paths.\n\n|\u00b5(cid:48)\ni \u2212 \u00b5i|.\n\ni\n\n4.3 Discussions and Comparisons\nWe now discuss the implications of Theorem 1 together with Lemmas 1 and 2 by comparing them\nwith several existing results.\n\n7\n\n\f(cid:112)|E| in the dominant terms of distribution-dependent and -independent bounds, respectively, due\n\nComparison with [7] and CMAB with \u221e-norm bounded smoothness conditions. Our work is\na direct adaption of the study in [7]. Comparing with [7], we see that the regret bounds in Theorem 1\nare not dependent on the inverse of triggering probabilities, which is the main issue in [7]. When\napplied to in\ufb02uence maximization bandit, our result is strictly stronger than that of [7] in two aspects:\n(a) we remove the factor of 1/p\u2217 by using the TPM condition; (b) we reduce a factor of |E| and\nto our use of 1-norm instead of \u221e-norm conditions used in Chen et al. [7]. In the supplementary\nmaterial, we further provide the corresponding \u221e-norm TPM bounded smoothness conditions and\nthe regret bound results, since in general the two sets of results do not imply each other.\n\nComparison with [25] on in\ufb02uence maximization bandits. Conceptually, our work deals with\nthe general CMAB-T framework with in\ufb02uence maximization and combinatorial cascading bandits\nas applications, while Wen et al. [25] only work on in\ufb02uence maximization bandit. Wen et al. [25]\nfurther study a generalization of linear transformation of edge probabilities, which is orthogonal\nto our current study, and could be potentially incorporated into the general CMAB-T framework.\nTechnically, both studies eliminate the exponential factor 1/p\u2217 in the regret bound. Comparing the\nrest terms in the regret bounds, our regret bound depends on a topology dependent term \u02dcC (Lemma 2),\nwhile their bound depends on a complicated term C\u2217, which is related to both topology and edge\nprobabilities. Although in general it is hard to compare the regret bounds, for the several graph\nfamilies for which Wen et al. [25] provide concrete topology-dependent regret bounds, our bounds\nk) to O(|V |), where k is the number of seeds selected in each\nare always better by a factor from O(\nround and V is the node set in the graph. This indicates that, in terms of characterizing the topology\neffect on the regret bound, our simple complexity term \u02dcC is more effective than their complicated\nterm C\u2217. See the supplementary material for the detailed table of comparison.\n\n\u221a\n\nconstant B = 1, achieving O((cid:80) 1\nan extra factor of 1/f\u2217, where f\u2217 =(cid:81)\n\n\u2206i\n\nmin\n\n\u221a\nK log T ) distribution-dependent, and O(\n\nComparison with [16] on combinatorial cascading bandits By Lemma 1, we can apply The-\norem 1 to combinatorial conjunctive and disjunctive cascading bandits with bounded smoothness\nmKT log T )\ndistribution-independent regret. In contrast, besides having exactly these terms, the results in [16] have\ni\u2208S\u2217 (1\u2212 p(i))\nfor disjunctive cascades, with S\u2217 being the optimal solution and p(i) being the probability of success\nfor item (arm) i. For conjunctive cascades, f\u2217 could be reasonably close to 1 in practice as argued in\n[16], but for disjunctive cascades, f\u2217 could be exponentially small since items in optimal solutions\ntypically have large p(i) values. Therefore, our result completely removes the dependency on 1/f\u2217\nand is better than their result. Moreover, we also have much smaller constant factors owing to the\nnew reverse amortization method described in Section 4.1.\n\ni\u2208S\u2217 p(i) for conjunctive cascades, and f\u2217 =(cid:81)\n\nComparison with [15] on linear bandits. When there is no probabilistically triggered arms\n(i.e. p\u2217 = 1), Theorem 1 would have tighter bounds since some analysis dealing with probabilistic\ntriggering is not needed. In particular, in Eq. (1) the leading constant 624 would be reduced to 48, the\n(cid:100)log2 x(cid:101)0 term is gone, and 6Bm becomes 2Bm; in Eq. (2) the leading constant 50 is reduced to 14,\nand the other changes are the same as above (see the supplementary material). The result itself is also\na new contribution, since it generalizes the linear bandit of [15] to general 1-norm conditions with\nmatching regret bounds, while signi\ufb01cantly reducing the leading constants (their constants are 534\nand 47 for distribution-dependent and independent bounds, respectively). This improvement comes\nfrom the new reversed amortization method described in Section 4.1.\n5 Lower Bound of the General CMAB-T Model\nIn this section, we show that there exists some CMAB-T problem instance such that the regret\ndistribution-independent bound are unavoidable, where p\u2217 is the minimum positive probability that\nany base arm i is triggered by any action S. It also implies that the TPM bounded smoothness may\nnot be applied to all CMAB-T instances.\nFor our purpose, we only need a simpli\ufb01ed version of the bounded smoothness condition of [7] as\nbelow: There exists a bounded smoothness constant B such that, for every action S and every pair of\nmean outcome vectors \u00b5 and \u00b5(cid:48), we have |rS(\u00b5) \u2212 rS(\u00b5(cid:48))| \u2264 B maxi\u2208 \u02dcS |\u00b5i \u2212 \u00b5(cid:48)\ni|, where \u02dcS is the\nset of arms that could possibly be triggered by S.\n\nbound in [7] is tight, i.e. the factor 1/p\u2217 in the distribution-dependent bound and(cid:112)1/p\u2217 in the\n\n8\n\n\fRegA\n\n\u00b5 (T ) \u2265 1\n170\n\nmT\np\n\n.\n\nproblems. Comparing to the upper bound O((cid:112)p\u22121mT log T ). obtained from [7], Theorem 2 implies\n\nThe proof of the above and the next theorem are all based on the results for the classical MAB\n\nthat the regret upper bound of CUCB in [7] is tight up to a O(\nlog T ) factor. This means that the\n1/p\u2217 factor in the regret bound of [7] cannot be avoided in the general class of CMAB-T problems.\nNext we give the distribution-dependent lower bound. For a learning algorithm, we say that it is\nconsistent if, for every \u00b5, every non-optimal arm is played o(T a) times in expectation, for any real\nnumber a > 0. Then we have the following distribution-dependent lower bound.\nTheorem 3. For any consistent algorithm A running on instance FTP(p) and \u00b5i < 1 for every arm\ni, we have\n\n\u221a\n\nlim inf\nT\u2192+\u221e\n\n\u00b5 (T )\n\nRegA\nln T\n\np\u22121\u2206i\nkl(\u00b5i, \u00b5\u2217)\n\n,\n\n\u2265 (cid:88)\n\ni:\u00b5i<\u00b5\u2217\n\ni\n\nis observed, and the reward of playing Si is p\u22121X (t)\n\nWe prove the lower bounds using the following CMAB-T problem instance ([m],S,D, Dtrig, R). For\neach base arm i \u2208 [m], we de\ufb01ne an action Si, with the set of actions S = {S1, . . . , Sm}. The family\nof distributions D consists of distributions generated by every \u00b5 \u2208 [0, 1]m such that the arms are\nindependent Bernoulli variables. When playing action Si in round t, with a \ufb01xed probability p, arm i\nis triggered and its outcome X (t)\n; otherwise\nwith probability 1\u2212 p no arm is triggered, no feedback is observed and the reward is 0. Following the\nCMAB-T framework, this means that Dtrig(Si, X), as a distribution on the subsets of [m], is either\n{i} with probability p or \u2205 with probability 1\u2212 p, and the reward R(Si, X, \u03c4 ) = p\u22121Xi\u00b7I{\u03c4 = {i}}.\nThe expected reward rSi(\u00b5) = \u00b5i. So this instance satis\ufb01es the above bounded smoothness with\nconstant B = 1. We denote the above instance as FTP(p), standing for \ufb01xed triggering probability\ninstance. This instance is similar with position-based model [17] with only one position, while the\nfeedback is different. For the FTP(p) instance, we have p\u2217 = p and rSi (\u00b5) = p \u00b7 p\u22121\u00b5i = \u00b5i.\nlog T ) and\n\nThen applying the result in [7], we have distributed-dependent upper bound O((cid:80)\ndistribution-independent upper bound O((cid:112)p\u22121mT log T ).\n\n1\np\u2206i\n\nmin\n\ni\n\ni\n\nWe \ufb01rst provide the distribution-independent lower bound result.\nTheorem 2. Let p be a real number with 0 < p < 1. Then for any CMAB-T algorithm A, if\nT \u2265 6p\u22121, there exists a CMAB-T environment instance D with mean \u00b5 such that on instance\nFTP(p),\n\n(cid:115)\n\nwhere \u00b5\u2217 = maxi \u00b5i, \u2206i = \u00b5\u2217 \u2212 \u00b5i, and kl(\u00b7,\u00b7) is the Kullback-Leibler divergence function.\nAgain we see that the distribution-dependent upper bound obtained from [7] asymptotically match the\nlower bound above. Finally, we remark that even if we rescale the reward from [1, 1/p] back to [0, 1],\nthe corresponding scaling factor B would become p, and thus we would still obtain the conclusion\nlog T ) factor), and thus 1/p\u2217 is in general needed in\nthat the regret bounds in [7] is tight (up to a O(\nthose bounds.\n\n\u221a\n\n6 Conclusion and Future Work\n\nIn this paper, we propose the TPM bounded smoothness condition, which conveys the intuition that\nan arm dif\ufb01cult to trigger is also less important in determining the optimal solution. We show that this\ncondition is essential to guarantee low regret, and prove that important applications, such as in\ufb02uence\nmaximization bandits and combinatorial cascading bandits all satisfy this condition.\nThere are several directions one may further pursue. One is to improve the regret bound for some\nspeci\ufb01c problems. For example, for the in\ufb02uence maximization bandit, can we give a better algorithm\nor analysis to achieve a better regret bound than the one provided by the general TPM condition?\nAnother direction is to look into other applications with probabilistically triggered arms that may not\nsatisfy the TPM condition or need other conditions to guarantee low regret. Combining the current\nCMAB-T framework with the linear generalization as in [25] to achieve scalable learning result is\nalso an interesting direction.\n\n9\n\n\fAcknowledgment\n\nWei Chen is partially supported by the National Natural Science Foundation of China (Grant No.\n61433014).\n\nReferences\n[1] Peter Auer, Nicol\u00f2 Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2-3):235\u2013256, 2002.\n\n[2] Peter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM J. Comput., 32(1):48\u201377, 2002.\n\n[3] Donald A. Berry and Bert Fristedt. Bandit problems: Sequential Allocation of Experiments. Chapman and\n\nHall, 1985.\n\n[4] S\u00e9bastien Bubeck and Nicol\u00f2 Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122, 2012.\n\n[5] Wei Chen, Laks V. S. Lakshmanan, and Carlos Castillo. Information and In\ufb02uence Propagation in Social\n\nNetworks. Morgan & Claypool Publishers, 2013.\n\n[6] Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, and Pinyan Lu. Combinatorial multi-armed bandit with general\n\nreward functions. In NIPS, 2016.\n\n[7] Wei Chen, Yajun Wang, Yang Yuan, and Qinshi Wang. Combinatorial multi-armed bandit and its extension\nto probabilistically triggered arms. Journal of Machine Learning Research, 17(50):1\u201333, 2016. A\npreliminary version appeared as Chen, Wang, and Yuan, \u201ccombinatorial multi-armed bandit: General\nframework, results and applications\u201d, ICML\u20192013.\n\n[8] Richard Combes, M. Sadegh Talebi, Alexandre Proutiere, and Marc Lelarge. Combinatorial bandits\n\nrevisited. In NIPS, 2015.\n\n[9] Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with unknown\nvariables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions\non Networking, 20, 2012.\n\n[10] Aditya Gopalan, Shie Mannor, and Yishay Mansour. Thompson sampling for complex online problems. In\n\nProceedings of the 31st International Conference on Machine Learning (ICML), 2014.\n\n[11] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American\n\nStatistical Association, 58(301):13\u201330, 1963.\n\n[12] David Kempe, Jon M. Kleinberg, and \u00c9va Tardos. Maximizing the spread of in\ufb02uence through a social\nnetwork. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining (KDD), pages 137\u2013146, 2003.\n\n[13] Branislav Kveton, Zheng Wen, Azin Ashkan, Hoda Eydgahi, and Brian Eriksson. Matroid bandits: Fast\nIn Proceedings of the 30th Conference on Uncertainty in\n\ncombinatorial optimization with learning.\nArti\ufb01cial Intelligence (UAI), 2014.\n\n[14] Branislav Kveton, Csaba Szepesv\u00e1ri, Zheng Wen, and Azin Ashkan. Cascading bandits: learning to rank\nin the cascade model. In Proceedings of the 32th International Conference on Machine Learning, 2015.\n\n[15] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesv\u00e1ri. Tight regret bounds for stochastic\ncombinatorial semi-bandits. In Proceedings of the 18th International Conference on Arti\ufb01cial Intelligence\nand Statistics, 2015.\n\n[16] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Combinatorial cascading bandits.\n\nAdvances in Neural Information Processing Systems, 2015.\n\n[17] Paul Lagr\u00e9e, Claire Vernade, and Olivier Capp\u00e9. Multiple-play bandits in the position-based model. In\n\nAdvances in Neural Information Processing Systems, pages 1597\u20131605, 2016.\n\n[18] Siyu Lei, Silviu Maniu, Luyi Mo, Reynold Cheng, and Pierre Senellart. Online in\ufb02uence maximization. In\n\nKDD, 2015.\n\n[19] Michael Mitzenmacher and Eli Upfal. Probability and Computing. Cambridge University Press, 2005.\n\n10\n\n\f[20] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin American Mathematical\n\nSociety, 55:527\u2013535, 1952.\n\n[21] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n\n[22] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the\n\nevidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[23] Sharan Vaswani, Laks V. S. Lakshmanan, and Mark Schmidt. In\ufb02uence maximization with bandits. In\n\nNIPS Workshop on Networks in the Social and Information Sciences, 2015.\n\n[24] Sharan Vaswani, Branislav Kveton, Zheng Wen, Mohammad Ghavamzadeh, Laks V.S. Lakshmanan, and\nMark Schmidt. Diffusion independent semi-bandit in\ufb02uence maximization. In Proceedings of the 34th\nInternational Conference on Machine Learning (ICML), 2017. to appear.\n\n[25] Zheng Wen, Branislav Kveton, and Michal Valko. In\ufb02uence maximization with semi-bandit feedback.\n\nCoRR, abs/1605.06593v1, 2016.\n\n[26] Yu Yang, Xiangbo Mao, Jian Pei, and Xiaofei He. Continuous in\ufb02uence maximization: What discounts\nIn Proceedings of the 2016 International Conference on\n\nshould we offer to social network users?\nManagement of Data (SIGMOD), 2016.\n\n11\n\n\f", "award": [], "sourceid": 780, "authors": [{"given_name": "Qinshi", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Wei", "family_name": "Chen", "institution": "Microsoft Research"}]}