{"title": "From Bandits to Experts: On the Value of Side-Observations", "book": "Advances in Neural Information Processing Systems", "page_first": 684, "page_last": 692, "abstract": "We consider an adversarial online learning setting where a decision maker can choose an action in every stage of the game. In addition to observing the reward of the chosen action, the decision maker gets side observations on the reward he would have obtained had he chosen some of the other actions. The observation structure is encoded as a graph, where node i is linked to node j if sampling i provides information on the reward of j. This setting naturally interpolates between the well-known ``experts'' setting, where the decision maker can view all rewards, and the multi-armed bandits setting, where the decision maker can only view the reward of the chosen action. We develop practical algorithms with provable regret guarantees, which depend on non-trivial graph-theoretic properties of the information feedback structure. We also provide partially-matching lower bounds.", "full_text": "From Bandits to Experts: On the Value of\n\nSide-Observations\n\nShie Mannor\n\nTechnion, Israel\n\nOhad Shamir\n\nUSA\n\nDepartment of Electrical Engineering\n\nMicrosoft Research New England\n\nshie@ee.technion.ac.il\n\nohadsh@microsoft.com\n\nAbstract\n\nWe consider an adversarial online learning setting where a decision maker can\nchoose an action in every stage of the game. In addition to observing the reward\nof the chosen action, the decision maker gets side observations on the reward he\nwould have obtained had he chosen some of the other actions. The observation\nstructure is encoded as a graph, where node i is linked to node j if sampling i pro-\nvides information on the reward of j. This setting naturally interpolates between\nthe well-known \u201cexperts\u201d setting, where the decision maker can view all rewards,\nand the multi-armed bandits setting, where the decision maker can only view the\nreward of the chosen action. We develop practical algorithms with provable regret\nguarantees, which depend on non-trivial graph-theoretic properties of the infor-\nmation feedback structure. We also provide partially-matching lower bounds.\n\n1\n\nIntroduction\n\nOne of the most basic learning settings studied in the online learning framework is learning from\nexperts. In its simplest form, we assume that each round t, the learning algorithm must choose one\nof k possible actions, which can be interpreted as following the advice of one of k \u201cexperts\u201d1. At the\nend of the round, the performance of all actions, measured here in terms of some reward, is revealed.\nThis process is iterated for T rounds, and our goal is to minimize the regret, namely the difference\nbetween the total reward of the single best action in hindsight, and our own accumulated reward.\nWe follow the standard online learning framework, in which nothing whatsoever can be assumed\non the process generating the rewards, and they might even be chosen by an adversary who has full\nknowledge of our learning algorithm.\nA crucial assumption in this setting is that we get to see the rewards of all actions at the end of each\nround. However, in many real-world scenarios, this assumption is unrealistic. A canonical example\nis web advertising, where at any timepoint one may choose only a single ad (or small number of ads)\nto display, and observe whether it was clicked, but not whether other ads would have been clicked\nor not if presented to the user. This partial information constraint has led to a \ufb02ourishing literature\non multi-armed bandits problems, which model the setting where we can only observe the reward\nof the action we chose. While this setting has been long studied under stochastic assumptions, the\nlandmark paper [4] showed that this setting can also be dealt with under adversarial conditions,\nmaking the setting comparable to the experts setting discussed above. The price in terms of the\nprovable regret is usually an extra \u221ak multiplicative factor in the bound. The intuition for this factor\nhas long been that in the bandit setting, we only get \u201c1/k of the information\u201d obtained in the expert\nsetting (as we observe just a single reward rather than k). While the bandits setting received much\ntheoretical interest, it has also been criticized for not capturing additional side-information we often\n\n1The more general setup, which is beyond the scope of this paper, considers k experts providing advice for\n\nchoosing among n actions, where in general n (cid:54)= k [4].\n\n1\n\n\fhave on the rewards of the different actions. This has led to studying richer settings, which make\nvarious assumptions on the relationship between the rewards; see below for more details.\nIn this paper, we formalize and initiate a study on a range of settings that interpolates between the\nbandits setting and the experts setting. Intuitively, we assume that after choosing some action i, and\nobtaining the action\u2019s reward, we observe not just action i\u2019s reward (as in the bandit setting), and\nnot the rewards of all actions (as in the experts setting), but rather some (possibly noisy) information\non a subset of the other actions. This subset may depend on action i in an arbitrary way, and may\nchange from round to round. This information feedback structure can be modeled as a sequence\nof directed graphs G1, . . . , GT (one per round t), so that an edge from action i to action j implies\nthat by choosing action i, \u201csuf\ufb01ciently good\u201d information is revealed on the reward of action j as\nwell. The case of Gt being the complete graph corresponds to the experts setting. The case of Gt\nbeing the empty graph corresponds to the bandit setting. The broad scenario of arbitrary graphs in\nbetween the two is the focus of our study.\nAs a motivating example, consider the problem of web advertising mentioned earlier. In the standard\nmulti-armed bandits setting, we assume that we have no information whatsoever on whether undis-\nplayed ads would have been clicked on. However, in many cases, we do have some side-information.\nFor instance, if two ads i, j are for similar vacation packages in Hawaii, and ad i was displayed and\nclicked on by some user, it is likely that the other ad j would have been clicked on as well. In\ncontrast, if ad i is for running shoes, and ad j is for wheelchair accessories, then a user who clicked\non one ad is unlikely to clique on the other. This sort of side-information can be better captured in\nour setting.\nAs another motivating example, consider a sensor network where each sensor collects data from a\ncertain geographic location. Each sensor covers an area that may overlap the area covered by other\nsensors. At every stage a centralized controller activates one of the sensors and receives input from\nit. The value of this input is modeled as the integral of some \u201cinformation\u201d in the covered area.\nSince the area covered by each of the sensors overlaps the area covered by other sensors, the reward\nobtained when choosing sensor i provides an indication of the reward that would have been obtained\nwhen sampling sensor j. A related example comes from ultra wideband communication networks,\nwhere every agent can select which channel to use for transmission. When using a channel, the\nagent senses if the transmission was successful, and also receives some indication of the noise level\nin other channels that are in adjacent frequency bands [2].\nOur results portray an interesting picture, with the attainable regret depending on non-trivial prop-\nerties of these graphs. We provide two practical algorithms with regret guarantees: the ExpBan\nalgorithm that is based on a combination of existing methods, and the more fundamentally novel\nELP algorithm that has superior guarantees. We also study lower bounds for our setting. In the\ncase of undirected graphs, we show that the information-theoretically attainable regret is precisely\ncharacterized by the average independence number (or stability number) of the graph, namely the\nsize of its largest independent set. For the case of directed graphs, we obtain a weaker regret which\ndepends on the average clique-partition number of the graphs. More speci\ufb01cally, our contributions\nare as follows:\n\n(cid:112)\n\n\u2022 We formally de\ufb01ne and initiate a study of the setting that interpolates between learning with\nexpert advice (with O(\nlog(k)T ) regret) that assumes that all rewards are revealed and\nthe multi-armed bandits setting (with \u02dcO(\u221akT ) regret) that assumes that only the reward of\nthe action selected is revealed. We provide an answer to a range of models in between.\n\u2022 The framework we consider assumes that by choosing each action, other than just obtaining\nthat action\u2019s reward, we can also observe some side-information about the rewards of other\nactions. We formalize this as a graph Gt over the actions, where an edge between two\nactions means that by choosing one action, we can also get a \u201csuf\ufb01ciently good\u201d estimate\nof the reward of the other action. We consider both the case where Gt changes at each\nround t, as well as the case that Gt = G is \ufb01xed throughout all rounds.\n\n\u2022 We establish upper and lower bounds on the achievable regret, which depends on two com-\nbinatorial properties of Gt: Its independence number \u03b1(Gt) (namely, the largest number\nof nodes without edges between them), and its clique-partition number \u00af\u03c7(Gt) (namely, the\nsmallest number of cliques into which the nodes can be partitioned).\n\n2\n\n\f(cid:112)\n\n\u2022 We present two practical algorithms to deal with this setting. The \ufb01rst algorithm, called\nExpBan, combines existing algorithms in a natural way, and applies only when Gt =\n(cid:112)c log(k)T ), where c is the size of the minimal clique partition one can ef\ufb01ciently \ufb01nd\nG is \ufb01xed at all T rounds. Ignoring computational constraints, the algorithm achieves a\nregret bound of O(\n\u00af\u03c7(G) log(k)T ). With computational constraints, its regret bound is\nO(\nfor G. However, note that for general graphs, it is NP-hard to \ufb01nd a clique partition for\nwhich c = O(k1\u2212\u0001) for any \u0001 > 0.\n\u2022 The second algorithm, called ELP, is an improved algorithm, which can handle graphs\nwhich change between rounds. For undirected graphs, where sampling i gives an obser-\nvation on j and vice versa, it achieves a regret bound of O(\nt=1 \u03b1(Gt)). For\ndirected graphs (where the observation structure is not symmetric), our regret bound is\nat most O(\nt=1 \u00af\u03c7(Gt)). Moreover, the algorithm is computationally ef\ufb01cient.\nThis is in contrast to the ExpBan algorithm, which in the worst case, cannot ef\ufb01ciently\nachieve regret signi\ufb01cantly better than O(\n\n(cid:113)\nlog(k)(cid:80)T\n\n(cid:113)\nlog(k)(cid:80)T\n\n(cid:112)k log(k)T ).\n\n\u2022 For the case of a \ufb01xed graph Gt = G, we present an information-theoretic \u2126\nlower bound on the regret, which holds regardless of computational ef\ufb01ciency.\n\n(cid:16)(cid:112)\u03b1(G)T\n\n(cid:17)\n\n\u2022 We present some simple synthetic experiments, which demonstrate that the potential ad-\nvantage of the ELP algorithm over other approaches is real, and not just an artifact of our\nanalysis.\n\n1.1 Related Work\n\nThe standard multi-armed bandits problem assumes no relationship between the actions. Quite a few\npapers studied alternative models, where the actions are endowed with a richer structure. However,\nin the large majority of such papers, the feedback structure is the same as in the standard multi-armed\nbandits. Examples include [11], where the actions\u2019 rewards are assumed to be drawn from a statis-\ntical distribution, with correlations between the actions; and [1, 8], where the actions reward\u2019s are\nassumed to satisfy some Lipschitz continuity property with respect to a distance measure between\nthe actions.\nIn terms of other approaches, the combinatorial bandits framework [7] considers a setting slightly\nsimilar to ours, in that one chooses and observes the rewards of some subset of actions. However, it\nis crucially assumed that the reward obtained is the sum of the rewards of all actions in the subset.\nIn other words, there is no separation between earning a reward and obtaining information on its\nvalue. Another relevant approach is partial monitoring, which is a very general framework for\nonline learning under partial feedback. However, this generality comes at the price of tractability for\nall but speci\ufb01c cases, which do not include our model.\nOur work is also somewhat related to the contextual bandit problem (e.g., [9, 10]), where the stan-\ndard multi-armed bandits setting is augmented with some side-information provided in each round,\nwhich can be used to determine which action to pick. While we also consider additional side-\ninformation, it is in a more speci\ufb01c sense. Moreover, our goal is still to compete against the best\nsingle action, rather than some set of policies which use this side-information.\n\n2 Problem Setting\n\nLet [k] = {1, . . . , k} and [T ] = {1, . . . , T}. We consider a set of actions 1, 2, . . . , k. Choosing\nan action i at round t results in receiving a reward gi(t), which we shall assume without loss of\ngenerality to be bounded in [0, 1]. Following the standard adversarial framework, we make no\nassumptions whatsoever about how the rewards are selected, and they might even be chosen by an\nadversary. We denote our choice of action at round t as it. Our goal is to minimize regret with\nrespect to the best single action in hindsight, namely\n\nT(cid:88)\n\nt=1\n\nmax\n\ni\n\nT(cid:88)\n\nt=1\n\ngit(t).\n\ngi(t) \u2212\n\n3\n\n\fAlgorithm 1 The ExpBan Algorithm\nInput: neighborhood sets {Ni(t)}i\u2208[k].\nSplit the graph induced by the neighborhood sets into c cliques (c \u2264 k as small as possible)\nFor each clique, de\ufb01ne a \u201cmeta-action\u201d to be a standard experts algorithm over the actions in the\nclique\nRun a multi-armed-bandits algorithm over the c meta-actions\n\nFor simplicity, we will focus on a \ufb01nite-horizon setting (where the number of rounds T is known in\nadvance), on regret bounds which hold in expectation, and on oblivious adversaries, namely that the\nreward sequence gi(t) is unknown but \ufb01xed in advance (see Sec. 8 for more on this issue).\nEach round t, the learning algorithm chooses a single action it. In the standard multi-armed bandits\nsetting, this results in git(t) being revealed to the algorithm, while gj(t) remains unknown for any\nj (cid:54)= it. In our setting, we assume that by choosing an action i, other than getting gi(t), we also\nget some side-observations about the rewards of the other actions. Formally, we assume that one\nreceives gi(t), and for some \ufb01xed parameter b is able to construct unbiased estimates \u02c6gj(t) for all\nactions j in some subset of [k], such that E[\u02c6gj(t)|action i chosen] = gj(t) and Pr(|\u02c6gj(t)| \u2264 b) = 1.\nFor any action j, we let Nj(t) be the set of actions, for which we can get such an estimate \u02c6gj(t) on the\nreward of action j. This is essentially the \u201cneighborhood\u201d of action j, which receives suf\ufb01ciently\ngood information (as parameterized by b) on the reward of action j. We note that j is always a\nmember of Nj, and moreover, Nj may be larger or smaller depending on the value of b we choose.\nWe assume that Nj(t) for all j, t are known to the learner in advance.\nIntuitively, one can think of this setting as a sequence of graphs, one graph per round t, which\ncaptures the information feedback structure between the actions. Formally, we de\ufb01ne Gt to be a\ngraph on the k nodes 1, . . . , k, with an edge from node i to node j if and only if j \u2208 Ni(t). In the\ncase that j \u2208 Ni(t) if and only if i \u2208 Nj(t), for all i, j, we say that Gt is undirected. We will use\nthis graph viewpoint extensively in the remainder of the paper.\n\n3 The ExpBan Algorithm\n\nWe begin by presenting the ExpBan algorithm (see Algorithm 1 above), which builds on existing\nalgorithms to deal with our setting, in the special case where the graph structure remains \ufb01xed\nthroughout the rounds - namely, Gt = G for all t. The idea of the algorithm is to split the actions\ninto c cliques, such that choosing an action in a clique reveals unbiased estimates of the rewards of\nall the other actions in the clique. By running a standard experts algorithm (such as the exponen-\ntially weighted forecaster - see [6, Chapter 2]), we can get low regret with respect to any action in\nthat clique. We then treat each such expert algorithm as a meta-action, and run a standard bandits\nalgorithm (such as the EXP3 [4]) over these c meta-actions. We denote this algorithm as ExpBan,\nsince it combines an experts algorithm with a bandit algorithm.\nThe following result provides a bound on the expected regret of the algorithm. The proof appears in\nthe appendix.\nTheorem 1. Suppose Gt = G is \ufb01xed for all T rounds. If we run ExpBan using the exponentially\nweighted forecaster and the EXP3 algorithm, then the expected regret is bounded as follows:2\n\n(cid:34) T(cid:88)\n\nt=1\n\ngj(t) \u2212 E\n\ngit(t)\n\nT(cid:88)\n\nt=1\n\n(cid:35)\n\n\u2264 4b(cid:112)c log(k)T .\n\n(1)\n\nFor the optimal clique partition, we have c = \u00af\u03c7(G), the clique-partition number of G.\n\n(cid:112)\n\nIt is easily seen that \u00af\u03c7(G) is a number between 1 and k. The case \u00af\u03c7(G) = 1 corresponds to G being\na clique, namely, that choosing any action allows us to estimate the rewards of all other actions.\nThis corresponds to the standard experts setting, in which case the algorithm attains the optimal\nlog(k)T ) regret. At the other extreme, \u00af\u03c7(G) = k corresponds to G being the empty graph,\nO(\n\n2Using more sophisticated methods, it is now known that the log(k) factor can be removed (e.g., [3]).\n\nHowever, we will stick with this slightly less tight analysis for simplicity.\n\n4\n\n\f(cid:112)\nnamely, that choosing any action only reveals the reward of that action. This corresponds to the\nstandard bandit setting, in which case the algorithm attains the standard O(\nlog(k)kT ) regret. For\ngeneral graphs, our algorithm interpolates between these regimes, in a way which depends on \u00af\u03c7(G).\nWhile being simple and using off-the-shelf components, the ExpBan algorithm has some disadvan-\ntages. First of all, for a general graph G, it is N P -hard to \ufb01nd c \u2264 O(k1\u2212\u0001) for any \u0001 > 0. (This\nfollows from [12] and the fact that the clique-partition number of G equals the chromatic number\nof its complement.) Thus, with computational constraints, one cannot hope to obtain a bound better\nthan \u02dcO(\u221akT ). That being said, we note that this is only a worst-case result, and in practice or for\n\nspeci\ufb01c classes of graphs, computing a good clique partition might be relatively easy. A second\ndisadvantage of the algorithm is that it is not applicable for an observation structure that changes\nwith time.\n\n4 The ELP Algorithm\n\nWe now turn to present the ELP algorithm (which stands for \u201cExponentially-weighted algorithm\nwith Linear Programming\u201d). Like all multi-armed bandits algorithms, it is based on a tradeoff\nbetween exploration and exploitation. However, unlike standard algorithms, the exploration com-\nponent is not uniform over the actions, but is chosen carefully to re\ufb02ect the graph structure at each\nround. In fact, the optimal choice of the exploration requires us to solve a simple linear program,\nhence the name of the algorithm. Below, we present the pseudo-code as well as a couple of theorems\nthat bound the expected regret of the algorithm under appropriate parameter choices. The proofs of\nthe theorems appear in the appendix. The \ufb01rst theorem concerns the symmetric observation case,\nwhere if choosing action i gives information on action j, then choosing action j must also give in-\nformation on i. The second theorem concerns the general case. We note that in both cases the graph\nGt may change arbitrarily in time.\n\nAlgorithm 2 The ELP Algorithm\nInput: \u03b2,{\u03b3(t)}t\u2208[T ],{si(t)}i\u2208[k],t\u2208[T ], neighborhood sets {Ni(t)}i\u2208[k],t\u2208[T ].\n\u2200 j \u2208 [k] wj(1) := 1/k.\nfor t = 1, . . . , T do\n\nwi(t)\nl=1 wl(k)\n\n(cid:80)k\n\u2200 i \u2208 [k] pi(t) := (1 \u2212 \u03b3(t))\nChoose action it with probability pit(t), and receive reward git(t)\nCompute \u02c6gj(t) for all j \u2208 Nit(t)\n(cid:80)\nFor all j \u2208 [k], let \u02dcgj(t) =\n\u2200 j \u2208 [k] wj(t + 1) = wj(t) exp(\u03b2\u02dcgj(t))\n\n+ \u03b3(t)si(t)\n\n\u02c6gj (t)\n\nl\u2208Nj (t) pl(t) if it \u2208 Nj(t), and \u02dcgj(t) = 0 otherwise.\n\nend for\n\n4.1 Undirected Graphs\n\nThe following theorem provides a regret bound for the algorithm, as well as appropriate parameter\nchoices, in the case of undirected graphs. Later on, we will discuss the case of directed graphs. In\na nutshell, the theorem shows that the regret bound depends on the average independence number\n\u03b1(Gt) of each graph Gt - namely, the size of its largest independent set.\nTheorem 2. Suppose that for all t, Gt is an undirected graph. Suppose we run Algorithm 2 using\nsome \u03b2 \u2208 (0, 1/2bk), and choosing\n{si(t)}i\u2208[k] =\n\n(cid:88)\n\nargmax\n\nmin\nj\u2208[k]\n\nsl(t),\n\ni si(t)=1\n\nl\u2208Nj (t)\n\n(which can be easily done via linear programming) and \u03b3(t) = \u03b2b/ minj\u2208[k]\nit holds for any \ufb01xed action j that\n\nl\u2208Nj (t) sl(t). Then\n\n(cid:80)\n\n\u2200i si(t)\u22650,(cid:80)\n(cid:35)\n(cid:34) T(cid:88)\n\ngit(t)\n\nt=1\n\n5\n\nT(cid:88)\n\nt=1\n\ngj(t) \u2212 E\n\nT(cid:88)\n\nt=1\n\n\u2264 3\u03b2b2\n\n\u03b1(Gt) +\n\nlog(k)\n\n\u03b2\n\n.\n\n(2)\n\n\fIf we choose \u03b2 =(cid:112)log(k)/3b2(cid:80)\n\nt \u03b1(Gt), then the bound equals\n\n(cid:118)(cid:117)(cid:117)(cid:116)3 log(k)\n\nb\n\nT(cid:88)\n\n\u03b1(Gt).\n\n(3)\n\nt=1\n\nComparing Thm. 2 with Thm. 1, we note that for any graph Gt, its independence number \u03b1(Gt)\nlower bounds its clique-partition number \u00af\u03c7(Gt). In fact, the gap between them can be very large\n(see Sec. 6). Thus, the attainable regret using the ELP algorithm is better than the one attained by the\nExpBan algorithm. Moreover, the ELP algorithm is able to deal with time-changing graphs, unlike\nthe ExpBan algorithm.\n(cid:80)T\nIf we take worst-case computational ef\ufb01ciency into account, things are slightly more involved.\nFor the ELP algorithm, the optimal value of \u03b2, needed to obtain Eq. (3), requires knowledge of\n(cid:80)T\nt=1 \u03b1(Gt), but computing or approximating the \u03b1(Gt) is NP-hard in the worst case. However,\nthere is a simple \ufb01x: we create (cid:100)log(k)(cid:101) copies of the ELP algorithm, where copy i assumes that\nt=1 \u03b1(Gt) equals 2i\u22121. Note that one of these values must be wrong by a factor of at most 2,\nso the regret of the algorithm using that value would be larger by a factor of at most 2. Of course,\nthe problem is that we don\u2019t know in advance which of those (cid:100)log(k)(cid:101) copies is the best one. But\nthis can be easily solved by treating each such copy as a \u201cmeta-action\u201d, and running a standard\n(cid:113)\nmulti-armed bandits algorithm (such as EXP3) over these (cid:100)log(k)(cid:101) actions. Note that the same idea\nwas used in the construction of the ExpBan algorithm. Since there are (cid:100)log(k)(cid:101) meta-actions, the\nadditional regret incurred is O(\nlog2(k)T ). So up to logarithmic factors in k, we get the same\nregret as if we could actually compute the optimal value of \u03b2.\n\n4.2 Directed Graphs\n\nSo far, we assumed that the graphs we are dealing with are all undirected. However, a natural\nextension of this setting is to assume a directed graph, where choosing an action i may give us\ninformation on the reward of action j, but not vice-versa. It is readily seen that the ExpBan algorithm\nwould still work in this setting, with the same guarantee. For the ELP algorithm, we can provide the\nfollowing guarantee:\nTheorem 3. Under the conditions of Thm. 2 (with the relaxation that the graphs Gt may be di-\nrected), it holds for any \ufb01xed action j that\n\nwhere \u00af\u03c7(Gt) is the clique-partition number of Gt. If we choose \u03b2 =(cid:112)log(k)/3b2(cid:80)\n\n\u00af\u03c7(Gt), +\n\n\u2264 3\u03b2b2\n\nt=1\n\nt=1\n\n\u03b2\n\n.\n\ngj(t) \u2212 E\n\nlog(k)\n\n(4)\n\nt \u00af\u03c7(Gt), then\n\nthe bound equals\n\nT(cid:88)\n\nT(cid:88)\n\nt=1\n\n(cid:35)\n\ngit(t)\n\n(cid:34) T(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116)3 log(k)\n\nb\n\nT(cid:88)\n\nt=1\n\n\u00af\u03c7(Gt).\n\n(5)\n\nNote that this bound is weaker than the one of Thm. 2, since \u03b1(Gt) \u2264 \u00af\u03c7(Gt) as discussed earlier. We\ndo not know whether this bound (relying on the clique-partition number) is tight, but we conjecture\nthat the independence number, which appears to be the key quantity in undirected graphs, is not the\ncorrect combinatorial measure for the case of directed graphs3. In any case, we note that even with\nthe weaker bound above, the ELP algorithm still seems superior to the ExpBan algorithm, in the\nsense that it allows us to deal with time-changing graphs, and that an explicit clique decomposition\nof the graph is not required. Also, we again have the issue of \u03b2 which is determined by a quantity\nwhich is NP-hard to compute, i.e. \u00af\u03c7(Gt). However, this can be circumvented using the same trick\ndiscussed in the context of undirected graphs.\n\nO((cid:112)k log(k)T ) bound, even when the independence number is 1\n\n3It is possible to construct examples where the analysis of the ELP algorithm necessarily leads to an\n\n6\n\n\f5 Lower Bound\n\nThe following theorem provides a lower bound on the regret in terms of the independence number\n\u03b1(G), for a constant graph Gt = G.\nTheorem 4. Suppose Gt = G for all t, and that actions which are not linked in G get no\nside-observations whatsoever between them. Then there exists a (randomized) adversary strat-\negy, such that for every T \u2265 374\u03b1(G)3 and any learning strategy, the expected regret is at least\n0.06\n\n(cid:112)\u03b1(G)T .\n\nA proof is provided in the appendix. The intuition of the proof is that if the graph G has \u03b1(G)\nindependent vertices, then an adversary can make this problem as hard as a standard multi-armed\nbandits problem, played on \u03b1(G) actions. Using a known lower bound of \u2126(\u221anT ) for multi-armed\nbandits on n actions, our result follows4.\nFor constant undirected graphs, this lower bound matches the regret upper bound for the ELP al-\ngorithm (Thm. 2) up to logarithmic factors. For directed graphs, the difference between them boils\ndown to the difference between \u00af\u03c7(G) and \u03b1(G). For many well-behaved graphs, this gap is rather\nsmall. However, for general graphs, the difference can be huge - see the next section for details.\n\n6 Examples\n\nHere, we brie\ufb02y discuss some concrete examples of graphs G, and show how the regret performance\nof our algorithms depend on their structure. An interesting issue to notice is the potential gap\nbetween the performance of our algorithms, through the graph\u2019s independence number \u03b1(G) and\nclique-partition number \u00af\u03c7(G).\nFirst, consider the case where there exists a single action, such that choosing it reveals the rewards\nof all the other actions. In contrast, choosing the other actions only reveal their own reward. At \ufb01rst\nblush, it may seem that having such a \u201csuper-action\u201d, which reveals everything that happens in the\ncurrent round, should help us improve our regret. However, the independence number \u03b1(G) of such\na graph is easily seen to be k \u2212 1. Based on our lower bound, we see that this \u201csuper-action\u201d is\nactually not helpful at all (up to negligible factors).\nSecond, consider the case where the actions are endowed with some metric distance function, and\nedge (i, j) is in G if and only if the distance between i, j is at most some \ufb01xed constant r. We can\nthink of each action i as being in the center of a sphere of radius r, such that the reward of action i is\npropagated to every other action in that sphere. In this case, \u03b1(G) is essentially the number of non-\noverlapping spheres we can pack in G. In contrast, \u00af\u03c7(G) is essentially the number of spheres we\nneed to cover G. Both numbers shrink rapidly as r increases, improving the regret of our algorithms.\nHowever, the sphere covering size can be much larger than the sphere packing size. For example, if\nthe actions are placed as the elements in {0, 1/2, 1}n, we use the l\u221e metric, and r \u2208 (1/2, 1), it is\neasily seen that the sphere packing number is just 1. In contrast, the sphere covering number is at\nleast 2n = klog3(2) \u2248 k0.63, since we need a separate sphere to cover every element in {0, 1}n.\nThird, consider the random Erd\u00a8os - R\u00b4enyi graph G = G(k, p), which is formed by linking every\naction i to every action j with probability p independently. It is well known that when p is a con-\nstant, the independence number \u03b1(G) of this graph is only O(log(k)), whereas the clique-partition\nnumber \u00af\u03c7(G) is at least \u2126(k/ log(k)). This translates to a regret bound of O(\u221akT ) for the Exp-\nBan algorithm, and only O(\ndirected random graph.\n\nlog2(k)T ) for the ELP algorithm. Such a gap would also hold for a\n\n(cid:113)\n\n7 Empirical Performance Gap between ExpBan and ELP\n\nIn this section, we show that the gap between the performance of the ExpBan algorithm and the ELP\nalgorithm can be real, and is not just an artifact of our analysis.\n\n4We note that if the maximal degree of every node is bounded by d, it is possible to get the lower bound for\n\nT \u2265 \u2126(d2\u03b1(G)) (as opposed to T \u2265 \u2126(\u03b1(G)3)); see the proof for details.\n\n7\n\n\fFigure 1: Experiments on random graphs.\n\nTo show this, we performed the following simple experiment: we created a random Erd\u00a8os - R\u00b4enyi\ngraph over 300 nodes, where each pair of nodes were linked independently with probability p.\nChoosing any action results in observing the rewards of neighboring actions in the graph. The reward\nof each action at each round was chosen randomly and independently to be 1 with probability 1/2\nand 0 with probability 1/2, except for a single node, whose reward equals 1 with a higher probability\nof 3/4. We then implemented the ExpBan and ELP algorithms in this setting, for T = 30, 000. For\ncomparison, we also implemented the standard EXP3 multi-armed bandits algorithm [4], which\ndoesn\u2019t use any side-observations. All the parameters were set to their theoretically optimal values.\nThe experiment was repeated for varying p and over 10 independent runs.\nThe results are displayed in Figure 1. The X-axis is the iteration number, and the Y -axis is the\nmean payoff obtained so far, averaged over the 10 runs (the variance in the numbers was minuscule,\nand therefore we do not report con\ufb01dence intervals). For p = 0.05, the graph is rather empty, and\nthe advantage of using side observations is not large. As a result, all 3 algorithms perform roughly\nthe same for this choice of T . As p increases, the value of side-obervations increase, and the the\nperformance of our two algorithms, which utilize side-observations, improves over the standard\nmulti-armed bandits algorithm. Moreover, for intermediate values of p, there is a noticeable gap\nbetween the performance of ExpBan and ELP. This is exactly the regime where the gap between\nthe clique-partition number (governing the regret bound of ExpBan) and the independence number\n(governing the regret bound for the ELP algorithm) tends to be larger as well5. Finally, for large p,\nthe graph is almost complete, and the advantage of ELP over ExpBan becomes small again (since\nmost actions give information on most other actions).\n\n8 Discussion\n\nIn this paper, we initiated a study of a large family of online learning problems with side observa-\ntions. In particular, we studied the broad regime which interpolates between the experts setting and\nthe bandits setting of online learning. We provided algorithms, as well as upper and lower bounds\non the attainable regret, with a non-trivial dependence on the information feedback structure.\nThere are many open questions that warrant further study. First, the upper and lower bounds essen-\ntially match only in particular settings (i.e., in undirected graphs, where no side-observations what-\nsoever, other than those dictated by the graph are allowed). Can this gap be narrowed or closed?\nSecond, our lower bounds depend on a reduction which essentially assumes that the graph is con-\nstant over time. We do not have a lower bound for changing graphs. Third, it remains to be seen\nwhether other online learning results can be generalized to our setting, such as learning with respect\nto policies (as in EXP4 [4]) and obtaining bounds which hold with high probability. Fourth, the\nmodel we have studied assumed that the observation structure is known. In many practical cases,\nthe observation structure may be known just partially or approximately.\nIs it possible to devise\nalgorithms for such cases?\nAcknowledgements. This research was supported in part by the Google Inter-university center for\nElectronic Markets and Auctions.\n\n5Intuitively, this can be seen by considering the extreme cases - for a complete graph over k nodes, both\nnumbers equal 1, and for an empty graph over k nodes, both numbers equal k. For constant p \u2208 (0, 1), there is\na real gap between the two, as discussed in Sec. 6\n\n8\n\n1230.40.60.8Iteration(\u2217104)AveragePayo\ufb00p=0.05 123Iteration(\u2217104)p=0.35123Iteration(\u2217104)p=0.65123Iteration(\u2217104)p=0.95EXP3ExpBanELP\fReferences\n[1] R. Agrawal. The continuum-armed bandit problem. SIAM J. Control and Optimization,\n\n33:1926\u20131951, 1995.\n\n[2] H. Arslan, Z. N. Chen, and M. G. Di Benedetto. Ultra Wideband Wireless Communication.\n\nWiley - Interscience, 2006.\n\n[3] J.-Y. Audibert and S. Bubeck. Minimax policies for adversarial and stochastic bandits.\n\nCOLT, 2009.\n\nIn\n\n[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM J. Comput., 32(1):48\u201377, 2002.\n\n[5] V. Baston. Some cyclic inequalities. Proceedings of the Edinburgh Mathematical Society\n\n(Series 2), 19:115\u2013118, 1974.\n\n[6] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\n[7] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. In COLT, 2009.\n[8] R. Kleinberg, A. Slivkins, and E. Upfal. Multi-armed bandits in metric spaces. In STOC, pages\n\n681\u2013690, 2008.\n\n[9] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. In NIPS, 2007.\n\n[10] L. Li, W. Chu, J. Langford, and R. Schapire. A contextual-bandit approach to personalized\nnews article recommendation. In Proceedings of the 19th international conference on World\nwide web, pages 661\u2013670. ACM, 2010.\n\n[11] P. Rusmevichientong and J. Tsitsiklis. Linearly parameterized bandits. Math. Oper. Res.,\n\n35(2):395\u2013411, 2010.\n\n[12] D. Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic\n\nnumber. Theory of Computing, 3(1):103\u2013128, 2007.\n\n9\n\n\f", "award": [], "sourceid": 483, "authors": [{"given_name": "Shie", "family_name": "Mannor", "institution": null}, {"given_name": "Ohad", "family_name": "Shamir", "institution": null}]}