{"title": "Efficient learning by implicit exploration in bandit problems with side observations", "book": "Advances in Neural Information Processing Systems", "page_first": 613, "page_last": 621, "abstract": "We consider online learning problems under a a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem.", "full_text": "Ef\ufb01cient learning by implicit exploration in bandit\n\nproblems with side observations\n\nTom\u00b4a\u02c7s Koc\u00b4ak\n\nR\u00b4emi Munos\u2217\n{tomas.kocak,gergely.neu,michal.valko,remi.munos}@inria.fr\n\nSequeL team, INRIA Lille \u2013 Nord Europe, France\n\nGergely Neu\n\nMichal Valko\n\nAbstract\n\nWe consider online learning problems under a a partial observability model cap-\nturing situations where the information conveyed to the learner is between full\ninformation and bandit feedback. In the simplest variant, we assume that in addi-\ntion to its own loss, the learner also gets to observe losses of some other actions.\nThe revealed losses depend on the learner\u2019s action and a directed observation sys-\ntem chosen by the environment. For this setting, we propose the \ufb01rst algorithm\nthat enjoys near-optimal regret guarantees without having to know the observa-\ntion system before selecting its actions. Along similar lines, we also de\ufb01ne a new\npartial information setting that models online combinatorial optimization prob-\nlems where the feedback received by the learner is between semi-bandit and full\nfeedback. As the predictions of our \ufb01rst algorithm cannot be always computed\nef\ufb01ciently in this setting, we propose another algorithm with similar properties\nand with the bene\ufb01t of always being computationally ef\ufb01cient, at the price of a\nslightly more complicated tuning mechanism. Both algorithms rely on a novel\nexploration strategy called implicit exploration, which is shown to be more ef\ufb01-\ncient both computationally and information-theoretically than previously studied\nexploration strategies for the problem.\n\n1\n\nIntroduction\n\nConsider the problem of sequentially recommending content for a set of users. In each period of\nthis online decision problem, we have to assign content from a news feed to each of our subscribers\nso as to maximize clickthrough. We assume that this assignment needs to be done well in advance,\nso that we only observe the actual content after the assignment was made and the user had the\nopportunity to click. While we can easily formalize the above problem in the classical multi-armed\nbandit framework [3], notice that we will be throwing out important information if we do so! The\nadditional information in this problem comes from the fact that several news feeds can refer to the\nsame content, giving us the opportunity to infer clickthroughs for a number of assignments that\nwe did not actually make. For example, consider the situation shown on Figure 1a. In this simple\nexample, we want to suggest one out of three news feeds to each user, that is, we want to choose a\nmatching on the graph shown on Figure 1a which covers the users. Assume that news feeds 2 and 3\nrefer to the same content, so whenever we assign news feed 2 or 3 to any of the users, we learn\nthe value of both of these assignments. The relations between these assignments can be described\nby a graph structure (shown on Figure 1b), where nodes represent user-news feed assignments, and\nedges mean that the corresponding assignments reveal the clickthroughs of each other. For a more\ncompact representation, we can group the nodes by the users, and rephrase our task as having to\nchoose one node from each group. Besides its own reward, each selected node reveals the rewards\nassigned to all their neighbors.\n\n\u2217Current af\ufb01liation: Google DeepMind\n\n1\n\n\fFigure 1a: Users and news feeds. The thick edges represent one\npotential matching of users to feeds, grouped news feeds show the\nsame content.\n\nFigure 1b: Users and news\nfeeds. Connected feeds mutually\nreveal each others clickthroughs.\n\nThe problem described above \ufb01ts into the framework of online combinatorial optimization where in\neach round, a learner selects one of a very large number of available actions so as to minimize the\nlosses associated with its sequence of decisions. Various instances of this problem have been widely\nstudied in recent years under different feedback assumptions [7, 2, 8], notably including the so-called\nfull-information [13] and semi-bandit [2, 16] settings. Using the example in Figure 1a, assuming full\ninformation means that clickthroughs are observable for all assignments, whereas assuming semi-\nbandit feedback, clickthroughs are only observable on the actually realized assignments. While\nit is unrealistic to assume full feedback in this setting, assuming semi-bandit feedback is far too\nrestrictive in our example. Similar situations arise in other practical problems such as packet routing\nin computer networks where we may have additional information on the delays in the network\nbesides the delays of our own packets.\nIn this paper, we generalize the partial observability model \ufb01rst proposed by Mannor and Shamir\n[15] and later revisited by Alon et al. [1] to accommodate the feedback settings situated between the\nfull-information and the semi-bandit schemes. Formally, we consider a sequential decision making\nproblem where in each time step t the (potentially adversarial) environment assigns a loss value to\neach out of d components, and generates an observation system whose role will be clari\ufb01ed soon.\nObliviously of the environment\u2019s choices, the learner chooses an action Vt from a \ufb01xed action\nset S \u2282 {0, 1}d represented by a binary vector with at most m nonzero components, and incurs\nthe sum of losses associated with the nonzero components of Vt. At the end of the round, the\nlearner observes the individual losses along the chosen components and some additional feedback\nbased on its action and the observation system. We represent this observation system by a directed\nobservability graph with d nodes, with an edge connecting i \u2192 j if and only if the loss associated\nwith j is revealed to the learner whenever Vt,i = 1. The goal of the learner is to minimize its total\nloss obtained over T repetitions of the above procedure. The two most well-studied variants of this\ngeneral framework are the multi-armed bandit problem [3] where each action consists of a single\ncomponent and the observability graph is a graph without edges, and the problem of prediction with\nexpert advice [17, 14, 5] where each action consists of exactly one component and the observability\ngraph is complete. In the true combinatorial setting where m > 1, the empty and complete graphs\ncorrespond to the semi-bandit and full-information settings respectively.\nOur model directly extends the model of Alon et al. [1], whose setup coincides with m = 1 in our\nframework. Alon et al. themselves were motivated by the work of Mannor and Shamir [15], who\nconsidered undirected observability systems where actions mutually uncover each other\u2019s losses.\nMannor and Shamir proposed an algorithm based on linear programming that achieves a regret of\n\u02dcO(\n\u221a\ncT ), where c is the number of cliques into which the graph can be split. Later, Alon et al. [1]\nproposed an algorithm called EXP3-SET that guarantees a regret of O(\n\u03b1T log d), where \u03b1 is an\nupper bound on the independence numbers of the observability graphs assigned by the environment.\nIn particular, this bound is tighter than the bound of Mannor and Shamir since \u03b1 \u2264 c for any graph.\nFurthermore, EXP3-SET is much more ef\ufb01cient than the algorithm of Mannor and Shamir as it only\nrequires running the EXP3 algorithm of Auer et al. [3] on the decision set, which runs in time linear\nin d. Alon et al. [1] also extend the model of Mannor and Shamir in allowing the observability\ngraph to be directed. For this setting, they offer another algorithm called EXP3-DOM with similar\nguarantees, although with the serious drawback that it requires access to the observation system\nbefore choosing its actions. This assumption poses severe limitations to the practical applicability\nof EXP3-DOM, which also needs to solve a sequence of set cover problems as a subroutine.\n\n\u221a\n\n2\n\ncontent1content2e1,1e1,2e1,3e2,1e2,2e2,3user1user2newsfeed1newsfeed2newsfeed3user1user2content2content2e1,1e1,2e1,3e2,1e2,2e2,3\fIn the present paper, we offer two computationally and information-theoretically ef\ufb01cient algorithms\nfor bandit problems with directed observation systems. Both of our algorithms circumvent the costly\nexploration phase required by EXP3-DOM by a trick that we will refer to IX as in Implicit eXplo-\nration. Accordingly, we name our algorithms EXP3-IX and FPL-IX, which are variants of the\nwell-known EXP3 [3] and FPL [12] algorithms enhanced with implicit exploration. Our \ufb01rst algo-\nrithm EXP3-IX is speci\ufb01cally designed1 to work in the setting of Alon et al. [1] with m = 1 and\ndoes not need to solve any set cover problems or have any sort of prior knowledge concerning the\nobservation systems chosen by the adversary.2 FPL-IX, on the other hand, does need either to solve\nset cover problems or have a prior upper bound on the independence numbers of the observability\ngraphs, but can be computed ef\ufb01ciently for a wide range of true combinatorial problems with m > 1.\nWe note that our algorithms do not even need to know the number of rounds T and our regret bounds\nscale with the average independence number \u00af\u03b1 of the graphs played by the adversary rather than the\nlargest of these numbers. They both employ adaptive learning rates and unlike EXP3-DOM, they\ndo not need to use a doubling trick to be anytime or to aggregate outputs of multiple algorithms to\noptimally set their learning rates. Both algorithms achieve regret guarantees of \u02dcO(m3/2\n\u00af\u03b1T ) in the\ncombinatorial setting, which becomes \u02dcO(\nBefore diving into the main content, we give an important graph-theoretic statement that we will\nrely on when analyzing both of our algorithms. The lemma is a generalized version of Lemma 13 of\nAlon et al. [1] and its proof is given in Appendix A.\nLemma 1. Let G be a directed graph with vertex set V = {1, . . . , d}. Let N\u2212\ni be the in-\nneighborhood of node i, i.e., the set of nodes j such that (j \u2192 i) \u2208 G. Let \u03b1 be the independence\n\n\u00af\u03b1T ) in the simple setting.\n\n\u221a\n\n\u221a\n\nnumber of G and p1,. . . ,pd are numbers from [0, 1] such that(cid:80)d\nwhere Pi =(cid:80)\n\npi\nm pi + 1\nm Pi + c\npj and c is a positive constant.\n\n\u2264 2m\u03b1 log\n\nd(cid:88)\n\n(cid:18)\n\n1 +\n\ni=1\n\n1\n\n\u03b1\n\nj\u2208N\n\n\u2212\ni\n\ni=1 pi \u2264 m. Then\n\n(cid:19)\n\nm(cid:100)d2/c(cid:101) + d\n\n+ 2m,\n\n2 Multi-armed bandit problems with side information\nIn this section, we start by the simplest setting \ufb01tting into our framework, namely the multi-armed\nbandit problem with side observations. We provide intuition about the implicit exploration procedure\nbehind our algorithms and describe EXP3-IX, the most natural algorithm based on the IX trick.\nThe problem we consider is de\ufb01ned as follows. In each round t = 1, 2, . . . , T , the environment as-\nsigns a loss vector (cid:96)t \u2208 [0, 1]d for d actions and also selects an observation system described by the\ndirected graph Gt. Then, based on its previous observations (and likely some external source of ran-\ndomness) the learner selects action It and subsequently incurs and observes loss (cid:96)t,It. Furthermore,\nthe learner also observes the losses (cid:96)t,j for all j such that (It \u2192 j) \u2208 Gt, denoted by the indicator\nOt,i. Let Ft\u22121 = \u03c3(It\u22121, . . . , I1) capture the interaction history up to time t. As usual in online\nsettings [6], the performance is measured in terms of (total expected) regret, which is the difference\nbetween a total loss received and the total loss of the best single action chosen in hindsight,\n\n(cid:34) T(cid:88)\n\n(cid:35)\n((cid:96)t,It \u2212 (cid:96)t,i)\n\n,\n\nt=1\n\nRT = max\ni\u2208[d]\n\nE\n\nwhere the expectation integrates over the random choices made by the learning algorithm. Alon\net al. [1] adapted the well-known EXP3 algorithm of Auer et al. [3] for this precise problem. Their\nalgorithm, EXP3-DOM, works by maintaining a weight wt,i for each individual arm i \u2208 [d] in each\nround, and selecting It according to the distribution\n\nP [It = i|Ft\u22121 ] = (1 \u2212 \u03b3)pt,i + \u03b3\u00b5t,i = (1 \u2212 \u03b3)\n\n+ \u03b3\u00b5t,i,\n\nwt,i(cid:80)d\n\nj=1 wt,j\n\n1EXP3-IX can also be ef\ufb01ciently implemented for some speci\ufb01c combinatorial decision sets even with\n\nm > 1, see, e.g., Cesa-Bianchi and Lugosi [7] for some examples.\n\n2However, it is still necessary to have access to the observability graph to construct low bias estimates of\n\nlosses, but only after the action is selected.\n\n3\n\n\fwhere \u03b3 \u2208 (0, 1) is parameter of the algorithm and \u00b5t is an exploration distribution whose role we\nwill shortly clarify. After each round, EXP3-DOM de\ufb01nes the loss estimates\n\n\u02c6(cid:96)t,i =\n\n(cid:96)t,i\not,i\n\n1{(It\u2192i)\u2208Gt} where\n\not,i = E [Ot,i |Ft\u22121 ] = P [(It \u2192 i) \u2208 Gt |Ft\u22121 ]\n\nfor each i \u2208 [d]. These loss estimates are then used to update the weights for all i as\n\nwt+1,i = wt,ie\u2212\u03b3 \u02c6(cid:96)t,i.\n\nIt is easy to see that the these loss estimates \u02c6(cid:96)t,i are unbiased estimates of the true losses whenever\npt,i > 0 holds for all i. This requirement along with another important technical issue justify\nthe presence of the exploration distribution \u00b5t. The key idea behind EXP3-DOM is to compute a\ndominating set Dt \u2286 [d] of the observability graph Gt in each round, and de\ufb01ne \u00b5t as the uniform\ndistribution over Dt. This choice ensures that ot,i \u2265 pt,i + \u03b3/|Dt|, a crucial requirement for the\nanalysis of [1]. In what follows, we propose an exploration scheme that does not need any fancy\ncomputations but, more importantly, works without any prior knowledge of the observability graphs.\n\n2.1 Ef\ufb01cient learning by implicit exploration\nIn this section, we propose the simplest exploration scheme imaginable, which consists of merely\npretending to explore. Precisely, we simply sample our action It from the distribution de\ufb01ned as\n\nP [It = i|Ft\u22121 ] = pt,i =\n\n,\n\n(1)\n\nwt,i(cid:80)d\n\nj=1 wt,j\n\nwithout explicitly mixing with any exploration distribution. Our key trick is to de\ufb01ne the loss esti-\nmates for all arms i as\n\n\u02c6(cid:96)t,i =\n\n(cid:96)t,i\n\not,i + \u03b3t\n\n1{(It\u2192i)\u2208Gt},\n\nwhere \u03b3t > 0 is a parameter of our algorithm. It is easy to check that \u02c6(cid:96)t,i is a biased estimate of (cid:96)t,i.\nThe nature of this bias, however, is very special. First, observe that \u02c6(cid:96)t,i is an optimistic estimate of\n\n(cid:96)t,i in the sense that E(cid:104)\u02c6(cid:96)t,i |Ft\u22121\n(cid:105) \u2264 (cid:96)t,i. That is, our bias always ensures that, on expectation, we\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Ft\u22121\n(cid:35)\n\nunderestimate the loss of any \ufb01xed arm i. Even more importantly, our loss estimates also satisfy\n\n(cid:18) ot,i\n\n(cid:34) d(cid:88)\n\npt,i(cid:96)t,i +\n\not,i + \u03b3t\n\npt,i\n\n\u02c6(cid:96)t,i\n\npt,i(cid:96)t,i\n\n(cid:19)\n\n\u2212 1\n\nE\n\ni=1\n\n(2)\n\nd(cid:88)\nd(cid:88)\n\ni=1\n\nd(cid:88)\nd(cid:88)\n\ni=1\n\n=\n\n=\n\npt,i(cid:96)t,i \u2212 \u03b3t\n\ni=1\n\ni=1\n\npt,i(cid:96)t,i\not,i + \u03b3t\n\n,\n\nthat is, the bias of the estimated losses suffered by our algorithm is directly controlled by \u03b3t. As we\nwill see in the analysis, it is suf\ufb01cient to control the bias of our own estimated performance as long\nas we can guarantee that the loss estimates associated with any \ufb01xed arm are optimistic\u2014which is\nprecisely what we have. Note that this slight modi\ufb01cation ensures that the denominator of \u02c6(cid:96)t,i is\nlower bounded by pt,i + \u03b3t, which is a very similar property as the one achieved by the exploration\nscheme used by EXP3-DOM. We call the above loss estimation method implicit exploration or IX,\nas it gives rise to the same effect as explicit exploration without actually having to implement any\nexploration policy. In fact, explicit and implicit explorations can both be regarded as two different\napproaches for bias-variance tradeoff: while explicit exploration biases the sampling distribution\nof It to reduce the variance of the loss estimates, implicit exploration achieves the same result by\nbiasing the loss estimates themselves.\nFrom this point on, we take a somewhat more predictable course and de\ufb01ne our algorithm EXP3-IX\nas a variant of EXP3 using the IX loss estimates. One of the twists is that EXP3-IX is actually based\non the adaptive learning-rate variant of EXP3 proposed by Auer et al. [4], which avoids the necessity\nof prior knowledge of the observability graphs in order to set a proper learning rate. This algorithm\n\nis de\ufb01ned by setting(cid:98)Lt\u22121,i =(cid:80)t\u22121\n\n\u02c6(cid:96)s,i and for all i \u2208 [d] computing the weights as\nwt,i = (1/d)e\u2212\u03b7t(cid:98)Lt\u22121,i .\n\ns=1\n\nThese weights are then used to construct the sampling distribution of It as de\ufb01ned in (1). The\nresulting EXP3-IX algorithm is shown as Algorithm 1.\n\n4\n\n\f2.2 Performance guarantees for EXP3-IX\n\nOur analysis follows the footsteps of Auer et al.\n[3] and Gy\u00a8or\ufb01 and Ottucs\u00b4ak [9], who provide\nan improved analysis of the adaptive learning-\nrate rule proposed by Auer et al. [4]. However,\na technical subtlety will force us to proceed a\nlittle differently than these standard proofs: for\nachieving the tightest possible bounds and the\nmost ef\ufb01cient algorithm, we need to tune our\nlearning rates according to some random quan-\ntities that depend on the performance of EXP3-\nIX. In fact, the key quantities in our analysis are\nthe terms\n\nd(cid:88)\n\ni=1\n\nQt =\n\npt,i\n\not,i + \u03b3t\n\n,\n\nAlgorithm 1 EXP3-IX\n1: Input: Set of actions S = [d],\n2:\n3: for t = 1 to T do\n4:\n5:\n\nparameters \u03b3t \u2208 (0, 1), \u03b7t > 0 for t \u2208 [T ].\n\nwt,i \u2190 (1/d) exp (\u2212\u03b7t(cid:98)Lt\u22121,i) for i \u2208 [d]\n\n6: Wt \u2190(cid:80)d\not,i \u2190(cid:80)\n\nAn adversary privately chooses losses (cid:96)t,i\nfor i \u2208 [d] and generates a graph Gt\ni=1 wt,i\npt,i \u2190 wt,i/Wt\nChoose It \u223c pt = (pt,1, . . . , pt,d)\nObserve graph Gt\nObserve pairs {i, (cid:96)t,i} for (It \u2192 i) \u2208 Gt\n1{(It\u2192i)\u2208Gt} for i \u2208 [d]\n\u02c6(cid:96)t,i \u2190 (cid:96)t,i\n\n7:\n8:\n9:\n10:\n11:\n12:\n13: end for\n\npt,j for i \u2208 [d]\n\n(j\u2192i)\u2208Gt\n\not,i+\u03b3t\n\nwhich depend on the interaction history Ft\u22121 for all t. Our theorem below gives the performance\nguarantee for EXP3-IX using a parameter setting adaptive to the values of Qt. A full proof of the\ntheorem is given in the supplementary material.\nTheorem 1. Setting \u03b7t = \u03b3t =\n\ns=1 Qs) , the regret of EXP3-IX satis\ufb01es\n\n(cid:113)\n\nRT \u2264 4E\n\n(cid:34)(cid:114)(cid:16)\n\n(log d)/(d +(cid:80)t\u22121\nd +(cid:80)T\n(cid:17)2\n(cid:16)\u02c6(cid:96)t,i\nT(cid:88)\n\nd(cid:88)\n\n(cid:17)\n\npt,i\n\ni=1\n\n+\n\n+ \u03b3t\n\nQt +\n\nt=1\n\n\u02c6(cid:96)t,i \u2264 \u03b7t\n2\n\ni=1\n\npt,i\n\nd(cid:88)\npt,i(cid:96)t,i \u2264 T(cid:88)\n(cid:114)(cid:16)\n\nt=1\n\n(cid:16) \u03b7t\nd +(cid:80)T\n\n2\n\nd(cid:88)\n\ni=1\n\nT(cid:88)\nd(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\ni=1\n\n(cid:17)\n\n(cid:20)\n\npt,i(cid:96)t,i \u2264 3\n\nt=1 Qt\n\nlog d +\n\nE\n\n\u2264 E\nfor any j \u2208 [d], giving the desired result as\n\n\u2212 log WT +1\n\n\u03b7T +1\n\n(cid:20) log W1\nT(cid:88)\n\n\u03b71\n\nd(cid:88)\n\n(cid:21)\npt,i(cid:96)t,i \u2264 T(cid:88)\n\n\u2212 log wT +1,j\n\n\u03b7T +1\n\n(cid:34)(cid:114)(cid:16)\n\n(cid:96)t,j + 4E\n\n(3)\n\n(4)\n\n(cid:35)\n\n.\n\n\u03b7t\n\nE\n\nlog d\n\nt=1 Qt\n\n(cid:17)\n(cid:18) log Wt\n(cid:20)(cid:18) log Wt\n(cid:20)(cid:18) log Wt\nT(cid:88)\n(cid:20) log d\n(cid:21)\n(cid:17)\n\nd +(cid:80)T\n\n= E\n\n\u03b7T +1\n\nt=1 Qt\n\nt=1\n\nE\n\n\u03b7t\n\n\u03b7t\n\n(cid:19)\n\n.\n\n\u2212 log Wt+1\n\n\u03b7t+1\n\n\u2212 log Wt+1\n\n\u03b7t+1\n\n(cid:21)\n\n.\n\n\u2212 log Wt+1\n\n.\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)Ft\u22121\n(cid:21)\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)Ft\u22121\n(cid:105)\n+ E(cid:104) \u02c6LT,j\n(cid:35)\n\n\u03b7t+1\n\n(cid:21)\n\nlog d\n\n,\n\nProof sketch. Following the proof of Lemma 1 in Gy\u00a8or\ufb01 and Ottucs\u00b4ak [9], we can prove that\n\nTaking conditional expectations, using Equation (2) and summing up both sides, we get\n\nUsing Lemma 3.5 of Auer et al. [4] and plugging in \u03b7t and \u03b3t, this becomes\n\nTaking expectations on both sides, the second term on the right hand side telescopes into\n\nt=1\n\ni=1\n\nt=1\n\nwhere we used the de\ufb01nition of \u03b7T and the optimistic property of the loss estimates.\n\nSetting m = 1 and c = \u03b3t in Lemma 1, gives the following deterministic upper bound on each Qt.\nLemma 2. For all t \u2208 [T ],\n\nd(cid:88)\n\nQt =\n\npt,i\n\not,i + \u03b3t\n\ni=1\n\n(cid:18)\n\n\u2264 2\u03b1t log\n\n5\n\n(cid:19)\n\n(cid:100)d2/\u03b3t(cid:101) + d\n\n\u03b1t\n\n1 +\n\n+ 2.\n\n\fCombining Lemma 2 with Theorem 1 we prove our main result concerning the regret of EXP3-IX.\nCorollary 1. The regret of EXP3-IX satis\ufb01es\n\n(cid:17)\n\n(cid:114)(cid:16)\nd + 2(cid:80)T\n(cid:100)d2(cid:112)td/ log d(cid:101) + d\n\n(cid:33)\n\nt=1 (Ht\u03b1t + 1)\n\nlog d,\n\n= O(log(dT )).\n\nwhere\n\nRT \u2264 4\n\n(cid:32)\n\nHt = log\n\n1 +\n\n\u03b1t\n\n3 Combinatorial semi-bandit problems with side observations\nWe now turn our attention to the setting of online combinatorial optimization (see [13, 7, 2]). In\nthis variant of the online learning problem, the learner has access to a possibly huge action set\nS \u2286 {0, 1}d where each action is represented by a binary vector v of dimensionality d. In what\nfollows, we assume that (cid:107)v(cid:107)1 \u2264 m holds for all v \u2208 S and some 1 \u2264 m (cid:28) d, with the case m = 1\ncorresponding to the multi-armed bandit setting considered in the previous section. In each round\nt = 1, 2, . . . , T of the decision process, the learner picks an action Vt \u2208 S and incurs a loss of V T\nt (cid:96)t.\nAt the end of the round, the learner receives some feedback based on its decision Vt and the loss\nvector (cid:96)t. The regret of the learner is de\ufb01ned as\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\nRT = max\nv\u2208S\n\nE\n\n(Vt \u2212 v)T (cid:96)t\n\n.\n\nPrevious work has considered the following feedback schemes in the combinatorial setting:\n\n\u221a\n\n\u2022 The full information scheme where the learner gets to observe (cid:96)t regardless of the chosen\naction. The minimax optimal regret of order m\nT log d here is achieved by COMPONENT-\nshown to enjoy a regret of order m3/2\u221a\nHEDGE algorithm of [13], while the Follow-the-Perturbed-Leader (FPL) [12, 10] was\n\u2022 The semi-bandit scheme where the learner gets to observe the components (cid:96)t,i of the loss\n\u221a\nvector where Vt,i = 1, that is, the losses along the components chosen by the learner at\ntime t. As shown by [2], COMPONENTHEDGE achieves a near-optimal O(\n\u221a\nmdT log d)\nregret guarantee, while [16] show that FPL enjoys a bound of O(m\ndT log d).\n\u2022 The bandit scheme where the learner only observes its own loss V T\n\nt (cid:96)t. There are currently\nno known ef\ufb01cient algorithms that get close to the minimax regret in this setting\u2014the\nreader is referred to Audibert et al. [2] for an overview of recent results.\n\nT log d by [16].\n\nIn this section, we de\ufb01ne a new feedback scheme situated between the semi-bandit and the full-\ninformation schemes. In particular, we assume that the learner gets to observe the losses of some\nother components not included in its own decision vector Vt. Similarly to the model of Alon et al.\n[1], the relation between the chosen action and the side observations are given by a directed observ-\nability Gt (see example in Figure 1). We refer to this feedback scheme as semi-bandit with side\nobservations. While our theoretical results stated in the previous section continue to hold in this set-\nting, combinatorial EXP3-IX could rarely be implemented ef\ufb01ciently\u2014we refer to [7, 13] for some\npositive examples. As one of the main concerns in this paper is computational ef\ufb01ciency, we take\na different approach: we propose a variant of FPL that ef\ufb01ciently implements the idea of implicit\nexploration in combinatorial semi-bandit problems with side observations.\n\n3.1\n\nImplicit exploration by geometric resampling\n\nIn each round t, FPL bases its decision on some estimate (cid:98)Lt\u22121 = (cid:80)t\u22121\nLt\u22121 =(cid:80)t\u22121\n\ns=1 (cid:96)s as follows:\n\ns=1\n\nvT(cid:16)\n\n\u03b7t(cid:98)Lt\u22121 \u2212 Zt\n\n(cid:17)\n\nVt = arg min\n\nv\u2208S\n\n\u02c6(cid:96)s of the total losses\n\n.\n\n(5)\n\nHere, \u03b7t > 0 is a parameter of the algorithm and Zt is a perturbation vector with components drawn\nindependently from an exponential distribution with unit expectation. The power of FPL lies in\nthat it only requires an oracle that solves the (of\ufb02ine) optimization problem minv\u2208S vT(cid:96) and thus\n\n6\n\n\fcan be used to turn any ef\ufb01cient of\ufb02ine solver into an online optimization algorithm with strong\nguarantees. To de\ufb01ne our algorithm precisely, we need to some further notation. We rede\ufb01ne Ft\u22121\nto be \u03c3(Vt\u22121, . . . , V1), Ot,i to be the indicator of the observed component and let\n\nqt,i = E [Vt,i |Ft\u22121 ]\n\nand\n\not,i = E [Ot,i |Ft\u22121 ] .\n\nThe most crucial point of our algorithm is the construction of our loss estimates. To implement\nthe idea of implicit exploration by optimistic biasing, we apply a modi\ufb01ed version of the geometric\nresampling method of Neu and Bart\u00b4ok [16] constructed as follows: Let O(cid:48)\nt(2), . . . be inde-\npendent copies3 of Ot and let Ut,i be geometrically distributed random variables for all i = [d] with\nparameter \u03b3t. We let\n\nt(1), O(cid:48)\n\nKt,i = min(cid:0)(cid:8)k : O(cid:48)\n\nt,i(k) = 1(cid:9) \u222a {Ut,i}(cid:1)\n\n(6)\n\nand de\ufb01ne our loss-estimate vector \u02c6(cid:96)t \u2208 Rd with its i-th element as\n\n(7)\nBy de\ufb01nition, we have E [Kt,i |Ft\u22121 ] = 1/(ot,i + (1\u2212 ot,i)\u03b3t), implying that our loss estimates are\noptimistic in the sense that they lower bound the losses in expectation:\n\n\u02c6(cid:96)t,i = Kt,iOt,i(cid:96)t,i.\n\nE(cid:104) \u02c6(cid:96)t,i\n\n(cid:105)\n\n(cid:12)(cid:12)(cid:12)Ft\u22121\n\not,i\n\n=\n\not,i + (1 \u2212 ot,i)\u03b3t\n\n(cid:96)t,i \u2264 (cid:96)t,i.\n\nHere we used the fact that Ot,i is independent of Kt,i and has expectation ot,i given Ft\u22121. We call\nthis algorithm Follow-the-Perturbed-Leader with Implicit eXploration (FPL-IX, Algorithm 2).\nNote that the geometric resampling procedure can be terminated as soon as Kt,i becomes well-\nde\ufb01ned for all i with Ot,i = 1. As noted by Neu and Bart\u00b4ok [16], this requires generating at most d\ncopies of Ot on expectation. As each of these copies requires one access to the linear optimization\noracle over S, we conclude that the expected running time of FPL-IX is at most d times that of\nthe expected running time of the oracle. A high-probability guarantee of the running time can be\n\n(cid:1) /\u03b3t holds with probability at least 1 \u2212 \u03b4 and thus we can\n(cid:1) /\u03b3t steps with probability at least 1 \u2212 \u03b4.\n\nobtained by observing that Ut,i \u2264 log(cid:0) 1\nstop sampling after at most d log(cid:0) d\n\n\u03b4\n\n\u03b4\n\n3.2 Performance guarantees for FPL-IX\nThe analysis presented in this section com-\nbines some techniques used by Kalai and Vem-\npala [12], Hutter and Poland [11], and Neu\nand Bart\u00b4ok [16] for analyzing FPL-style learn-\ners. Our proofs also heavily rely on some spe-\nci\ufb01c properties of the IX loss estimate de\ufb01ned\nin Equation 7. The most important difference\nfrom the analysis presented in Section 2.2 is\nthat now we are not able to use random learn-\ning rates as we cannot compute the values cor-\nresponding to Qt ef\ufb01ciently. In fact, these val-\nues are observable in the information-theoretic\nsense, so we could prove bounds similar to The-\norem 1 had we had access to in\ufb01nite compu-\ntational resources. As our focus in this paper\nis on computationally ef\ufb01cient algorithms, we\nchoose to pursue a different path. In particular,\n\nAlgorithm 2 FPL-IX\n1: Input: Set of actions S,\n2:\n3: for t = 1 to T do\n4:\n\nparameters \u03b3t \u2208 (0, 1), \u03b7t > 0 for t \u2208 [T ].\n\nAn adversary privately chooses losses (cid:96)t,i\nfor all i \u2208 [d] and generates a graph Gt\nDraw Zt,i \u223c Exp(1) for all i \u2208 [d]\n\nVt \u2190 arg minv\u2208S vT(cid:16)\n\n\u03b7t(cid:98)Lt\u22121 \u2212 Zt\n\n(cid:17)\n\nReceive loss V T\nt (cid:96)t\nObserve graph Gt\nObserve pairs {i, (cid:96)t,i} for all i, such that\n(j \u2192 i) \u2208 Gt and v(It)j = 1\nCompute Kt,i for all i \u2208 [d] using Eq. (6)\n\u02c6(cid:96)t,i \u2190 Kt,iOt,i(cid:96)t,i\n\n5:\n6:\n7:\n8:\n9:\n\n10:\n11:\n12: end for\n\nour learning rates will be tuned according to ef\ufb01ciently computable approximations (cid:101)\u03b1t of the re-\nspective independence numbers \u03b1t that satisfy \u03b1t/C \u2264(cid:101)\u03b1t \u2264 \u03b1t \u2264 d for some C \u2265 1. For the sake\n\nof simplicity, we analyze the algorithm in the oblivious adversary model. The following theorem\nstates the performance guarantee for FPL-IX in terms of the learning rates and random variables of\nthe form\n\nd(cid:88)\n\n(cid:101)Qt(c) =\n\nqt,i\n\n.\n\not,i + c\n\ni=1\n\n3Such independent copies can be simply generated by sampling independent copies of Vt using the FPL rule\nt(k) using the observability Gt. Notice that this procedure requires no interaction\n\n(5) and then computing O(cid:48)\nbetween the learner and the environment, although each sample requires an oracle access.\n\n7\n\n\fTheorem 2. Assume \u03b3t \u2264 1/2 for all t and \u03b71 \u2265 \u03b72 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03b7T . The regret of FPL-IX satis\ufb01es\n\nRT \u2264 m (log d + 1)\n\n\u03b7T\n\n+ 4m\n\n\u03b7tE\n\nT(cid:88)\n\nt=1\n\n(cid:19)(cid:21)\n\n(cid:18) \u03b3t\n\n1 \u2212 \u03b3t\n\nT(cid:88)\n\nt=1\n\n+\n\n\u03b3tE(cid:104)(cid:101)Qt(\u03b3t)\n(cid:105)\n\n.\n\nProof sketch. As usual for analyzing FPL methods [12, 11, 16], we \ufb01rst de\ufb01ne a hypothetical learner\n\nthat uses a time-independent perturbation vector (cid:101)Z \u223c Z1 and has access to \u02c6(cid:96)t on top of (cid:98)Lt\u22121\n\n\u03b7t(cid:98)Lt \u2212 (cid:101)Z\n\n(cid:17)\n\n.\n\n.\n\nE\n\nt=1\n\n\u03b7T\n\nE(cid:104)\n\n\u2264 m (log d + 1)\n\nwhich we can further upper bounded after a long and tedious calculation as\n\nClearly, this learner is infeasible as it uses observations from the future. Also, observe that this\nlearner does not actually interact with the environment and depends on the predictions made by the\nactual learner only through the loss estimates. By standard arguments, we can prove\n\n(cid:16)(cid:101)Vt \u2212 v\n(cid:17)T \u02c6(cid:96)t\nUsing the techniques of Neu and Bart\u00b4ok [16], we can relate the performance of Vt to that of (cid:101)Vt,\n(cid:17)2(cid:12)(cid:12)(cid:12)(cid:12)Ft\u22121\n(cid:20)(cid:16)(cid:101)V T\n(cid:20)(cid:101)Qt\n(cid:105) \u2264 \u03b7t E\n(cid:12)(cid:12)(cid:12)Ft\u22121\nThe result follows by observing that E(cid:104)\n(cid:105) \u2264 vT(cid:96)t for any \ufb01xed v \u2208 S by the optimistic\n(cid:12)(cid:12)(cid:12)Ft\u22121\n(cid:105)\nt (cid:96)t|Ft\u22121] \u2212 \u03b3tE(cid:104)(cid:101)Qt(\u03b3t)\n(cid:105) \u2265 E [ V T\nfollows from observing that ot,i \u2265 (1/m)(cid:80)\n\nproperty of the IX estimate and also from the fact that by the de\ufb01nition of the estimates we infer that\n\nThe next lemma shows a suitable upper bound for the last two terms in the bound of Theorem 2. It\n\n(cid:12)(cid:12)(cid:12)Ft\u22121\nE(cid:104)(cid:101)V T\n\n(Vt \u2212 (cid:101)Vt)T \u02c6(cid:96)t\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)Ft\u22121\n\nt,i\u222a{i}} qt,j and applying Lemma 1.\n\u2212\n\n(cid:18) \u03b3\n\n\u2264 4m\u03b7tE\n\n1 \u2212 \u03b3\n\nt\u22121\n\n\u02c6(cid:96)t\n\nt\u22121\n\n\u02c6(cid:96)t\n\nvT \u02c6(cid:96)t\n\nj\u2208{N\n\n(cid:21)\n\n(cid:21)\n\n.\n\n.\n\nLemma 3. For all t \u2208 [T ] and any c \u2208 (0, 1),\n\nd(cid:88)\n\n(cid:101)Qt(c) =\n\n(cid:18)\n\n(cid:19)\n\nqt,i\n\n\u2264 2m\u03b1t log\n\n1 +\n\not,i + c\n\ni=1\n\nm(cid:100)d2/c(cid:101) + d\n\n\u03b1t\n\n+ 2m.\n\nWe are now ready to state the main result of this section, which is obtained by combining Theorem 2,\nLemma 3, and Lemma 3.5 of Auer et al. [4] applied to the following upper bound\n\nt=1\n\nCorollary 2. Assume that for all t \u2208 [T ], \u03b1t/C \u2264 (cid:101)\u03b1t \u2264 \u03b1t \u2264 d for some C > 1, and assume\n\ns=1 \u03b1s/C\n\nmd > 4. Setting \u03b7t = \u03b3t =\n\n(log d + 1) /\n\nm\n\n, the regret of FPL-IX satis\ufb01es\n\nd + C(cid:80)T\n\nt=1 \u03b1t.\n\n\u2264 2\n\n(cid:113)\n(cid:113)\nC(cid:80)T\nt=1 \u03b1t \u2264 2\n(cid:17)(cid:17)\nd +(cid:80)t\u22121\ns=1(cid:101)\u03b1s\n\n(cid:16)\n\nT(cid:88)\n\n\u03b1t\n\n(cid:113)(cid:80)t\n\n\u2264 T(cid:88)\n(cid:113)\nd +(cid:80)t\u22121\ns=1(cid:101)\u03b1s\n(cid:114)\n(cid:114)(cid:16)\nd + C(cid:80)T\n\nRT \u2264 Hm3/2\n\nt=1\n\n\u03b1t\n\n(cid:16)\n(cid:17)\n\n(cid:20)(cid:101)Qt\nvT(cid:16)\n(cid:35)\n\nv\u2208S\n\n(cid:101)Vt = arg min\n(cid:34) T(cid:88)\n\nt=1 \u03b1t\n\n(log d + 1), where H = O(log(mdT )).\n\nConclusion We presented an ef\ufb01cient algorithm for learning with side observations based on im-\nplicit exploration. This technique gave rise to multitude of improvements. Remarkably, our algo-\nrithms no longer need to know the observation system before choosing the action unlike the method\nof [1]. Moreover, we extended the partial observability model of [15, 1] to accommodate problems\nwith large and structured action sets and also gave an ef\ufb01cient algorithm for this setting.\n\nAcknowledgements The research presented in this paper was supported by French Ministry\nof Higher Education and Research, by European Community\u2019s Seventh Framework Programme\n(FP7/2007-2013) under grant agreement no270327 (CompLACS), and by FUI project Herm`es.\n\n8\n\n\fReferences\n[1] Alon, N., Cesa-Bianchi, N., Gentile, C., and Mansour, Y. (2013). From Bandits to Experts: A\n\nTale of Domination and Independence. In Neural Information Processing Systems.\n\n[2] Audibert, J. Y., Bubeck, S., and Lugosi, G. (2014). Regret in Online Combinatorial Optimiza-\n\ntion. Mathematics of Operations Research, 39:31\u201345.\n\n[3] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002a). The nonstochastic multi-\n\narmed bandit problem. SIAM J. Comput., 32(1):48\u201377.\n\n[4] Auer, P., Cesa-Bianchi, N., and Gentile, C. (2002b). Adaptive and self-con\ufb01dent on-line learning\n\nalgorithms. Journal of Computer and System Sciences, 64:48\u201375.\n\n[5] Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D., Schapire, R., and Warmuth, M.\n\n(1997). How to use expert advice. Journal of the ACM, 44:427\u2013485.\n\n[6] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge Univer-\n\nsity Press, New York, NY, USA.\n\n[7] Cesa-Bianchi, N. and Lugosi, G. (2012). Combinatorial bandits. Journal of Computer and\n\nSystem Sciences, 78:1404\u20131422.\n\n[8] Chen, W., Wang, Y., and Yuan, Y. (2013). Combinatorial Multi-Armed Bandit: General Frame-\n\nwork and Applications. In International Conference on Machine Learning, pages 151\u2013159.\n\n[9] Gy\u00a8or\ufb01, L. and Ottucs\u00b4ak, b. (2007). Sequential prediction of unbounded stationary time series.\n\nIEEE Transactions on Information Theory, 53(5):866\u20131872.\n\n[10] Hannan, J. (1957). Approximation to Bayes Risk in Repeated Play. Contributions to the theory\n\nof games, 3:97\u2013139.\n\n[11] Hutter, M. and Poland, J. (2004). Prediction with Expert Advice by Following the Perturbed\n\nLeader for General Weights. In Algorithmic Learning Theory, pages 279\u2013293.\n\n[12] Kalai, A. and Vempala, S. (2005). Ef\ufb01cient algorithms for online decision problems. Journal\n\nof Computer and System Sciences, 71:291\u2013307.\n\n[13] Koolen, W. M., Warmuth, M. K., and Kivinen, J. (2010). Hedging structured concepts. In\n\nProceedings of the 23rd Annual Conference on Learning Theory (COLT), pages 93\u2013105.\n\n[14] Littlestone, N. and Warmuth, M. (1994). The weighted majority algorithm. Information and\n\nComputation, 108:212\u2013261.\n\n[15] Mannor, S. and Shamir, O. (2011).\n\nFrom Bandits to Experts: On the Value of Side-\n\nObservations. In Neural Information Processing Systems.\n\n[16] Neu, G. and Bart\u00b4ok, G. (2013). An Ef\ufb01cient Algorithm for Learning with Semi-bandit Feed-\nIn Jain, S., Munos, R., Stephan, F., and Zeugmann, T., editors, Algorithmic Learning\nback.\nTheory, volume 8139 of Lecture Notes in Computer Science, pages 234\u2013248. Springer Berlin\nHeidelberg.\n\n[17] Vovk, V. (1990). Aggregating strategies.\n\nIn Proceedings of the third annual workshop on\n\nComputational learning theory (COLT), pages 371\u2013386.\n\n9\n\n\f", "award": [], "sourceid": 425, "authors": [{"given_name": "Tom\u00e1\u0161", "family_name": "Koc\u00e1k", "institution": "Inria Lille - Nord Europe"}, {"given_name": "Gergely", "family_name": "Neu", "institution": "INRIA"}, {"given_name": "Michal", "family_name": "Valko", "institution": "INRIA Lille - Nord Europe"}, {"given_name": "Remi", "family_name": "Munos", "institution": "INRIA / MSR"}]}