{"title": "A Gang of Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 737, "page_last": 745, "abstract": "Multi-armed bandit problems are receiving a great deal of attention because they adequately formalize the exploration-exploitation trade-offs arising in several industrially relevant applications, such as online advertisement and, more generally, recommendation systems. In many cases, however, these applications have a strong social component, whose integration in the bandit algorithm could lead to a dramatic performance increase. For instance, we may want to serve content to a group of users by taking advantage of an underlying network of social relationships among them. In this paper, we introduce novel algorithmic approaches to the solution of such networked bandit problems. More specifically, we design and analyze a global strategy which allocates a bandit algorithm to each network node (user) and allows it to \u201cshare\u201d signals (contexts and payoffs) with the neghboring nodes. We then derive two more scalable variants of this strategy based on different ways of clustering the graph nodes. We experimentally compare the algorithm and its variants to state-of-the-art methods for contextual bandits that do not use the relational information. Our experiments, carried out on synthetic and real-world datasets, show a marked increase in prediction performance obtained by exploiting the network structure.", "full_text": "A Gang of Bandits\n\nNicol`o Cesa-Bianchi\n\nUniversit`a degli Studi di Milano, Italy\nnicolo.cesa-bianchi@unimi.it\n\nClaudio Gentile\n\nUniversity of Insubria, Italy\n\nclaudio.gentile@uninsubria.it\n\nGiovanni Zappella\n\nUniversit`a degli Studi di Milano, Italy\ngiovanni.zappella@unimi.it\n\nAbstract\n\nMulti-armed bandit problems formalize the exploration-exploitation trade-offs\narising in several industrially relevant applications, such as online advertisement\nand, more generally, recommendation systems. In many cases, however, these\napplications have a strong social component, whose integration in the bandit al-\ngorithm could lead to a dramatic performance increase. For instance, content may\nbe served to a group of users by taking advantage of an underlying network of\nsocial relationships among them. In this paper, we introduce novel algorithmic\napproaches to the solution of such networked bandit problems. More speci\ufb01cally,\nwe design and analyze a global recommendation strategy which allocates a bandit\nalgorithm to each network node (user) and allows it to \u201cshare\u201d signals (contexts\nand payoffs) with the neghboring nodes. We then derive two more scalable vari-\nants of this strategy based on different ways of clustering the graph nodes. We\nexperimentally compare the algorithm and its variants to state-of-the-art methods\nfor contextual bandits that do not use the relational information. Our experiments,\ncarried out on synthetic and real-world datasets, show a consistent increase in\nprediction performance obtained by exploiting the network structure.\n\n1\n\nIntroduction\n\nThe ability of a website to present personalized content recommendations is playing an increasingly\ncrucial role in achieving user satisfaction. Because of the appearance of new content, and due to\nthe ever-changing nature of content popularity, modern approaches to content recommendation are\nstrongly adaptive, and attempt to match as closely as possible users\u2019 interests by learning good map-\npings between available content and users. These mappings are based on \u201ccontexts\u201d, that is sets of\nfeatures that, typically, are extracted from both contents and users. The need to focus on content\nthat raises the user interest and, simultaneously, the need of exploring new content in order to glob-\nally improve the user experience creates an exploration-exploitation dilemma, which is commonly\nformalized as a multi-armed bandit problem. Indeed, contextual bandits have become a reference\nmodel for the study of adaptive techniques in recommender systems (e.g, [5, 7, 15] ). In many cases,\nhowever, the users targeted by a recommender system form a social network. The network struc-\nture provides an important additional source of information, revealing potential af\ufb01nities between\npairs of users. The exploitation of such af\ufb01nities could lead to a dramatic increase in the quality of\nthe recommendations. This is because the knowledge gathered about the interests of a given user\nmay be exploited to improve the recommendation to the user\u2019s friends. In this work, an algorithmic\napproach to networked contextual bandits is proposed which is provably able to leverage user simi-\nlarities represented as a graph. Our approach consists in running an instance of a contextual bandit\nalgorithm at each network node. These instances are allowed to interact during the learning process,\n\n1\n\n\fsharing contexts and user feedbacks. Under the modeling assumption that user similarities are prop-\nerly re\ufb02ected by the network structure, interactions allow to effectively speed up the learning process\nthat takes place at each node. This mechanism is implemented by running instances of a linear con-\ntextual bandit algorithm in a speci\ufb01c reproducing kernel Hilbert space (RKHS). The underlying\nkernel, previously used for solving online multitask classi\ufb01cation problems (e.g., [8]), is de\ufb01ned in\nterms of the Laplacian matrix of the graph. The Laplacian matrix provides the information we rely\nupon to share user feedbacks from one node to the others, according to the network structure. Since\nthe Laplacian kernel is linear, the implementation in kernel space is straightforward. Moreover, the\nexisting performance guarantees for the speci\ufb01c bandit algorithm we use can be directly lifted to\nthe RKHS, and expressed in terms of spectral properties of the user network. Despite its crispness,\nthe principled approach described above has two drawbacks hindering its practical usage. First,\nrunning a network of linear contextual bandit algorithms with a Laplacian-based feedback sharing\nmechanism may cause signi\ufb01cant scaling problems, even on small to medium sized social networks.\nSecond, the social information provided by the network structure at hand need not be fully reliable\nin accounting for user behavior similarities. Clearly enough, the more such algorithms hinge on\nthe network to improve learning rates, the more they are penalized if the network information is\nnoisy and/or misleading. After collecting empirical evidence on the sensitivity of networked bandit\nmethods to graph noise, we propose two simple modi\ufb01cations to our basic strategy, both aimed at\ncircumventing the above issues by clustering the graph nodes. The \ufb01rst approach reduces graph\nnoise simply by deleting edges between pairs of clusters. By doing that, we end up running a scaled\ndown independent instance of our original strategy on each cluster. The second approach treats each\ncluster as a single node of a much smaller cluster network. In both cases, we are able to empirically\nimprove prediction performance, and simultaneously achieve dramatic savings in running times.\nWe run experiments on two real-world datasets: one is extracted from the social bookmarking web\nservice Delicious, and the other one from the music streaming platform Last.fm.\n\n2 Related work\n\nThe bene\ufb01t of using social relationships in order to improve the quality of recommendations is\na recognized fact in the literature of content recommender systems \u2014see e.g., [5, 13, 18] and the\nsurvey [3]. Linear models for contextual bandits were introduced in [4]. Their application to person-\nalized content recommendation was pioneered in [15], where the LinUCB algorithm was introduced.\nAn analysis of LinUCB was provided in the subsequent work [9]. To the best of our knowledge,\nthis is the \ufb01rst work that combines contextual bandits with the social graph information. However,\nnon-contextual stochastic bandits in social networks were studied in a recent independent work [20].\nOther works, such as [2, 19], consider contextual bandits assuming metric or probabilistic dependen-\ncies on the product space of contexts and actions. A different viewpoint, where each action reveals\ninformation about other actions\u2019 payoffs, is the one studied in [7, 16], though without the context\nprovided by feature vectors. A non-contextual model of bandit algorithms running on the nodes of\na graph was studied in [14]. In that work, only one node reveals its payoffs, and the statistical infor-\nmation acquired by this node over time is spread across the entire network following the graphical\nstructure. The main result shows that the information \ufb02ow rate is suf\ufb01cient to control regret at each\nnode of the network. More recently, a new model of distributed non-contextual bandit algorithms\nhas been presented in [21], where the number of communications among the nodes is limited, and\nall the nodes in the network have the same best action.\n\n3 Learning model\n\n(cid:3)n\n\nof its Laplacian matrix L = (cid:2)Li,j\n\nWe assume the social relationships over users are encoded as a known undirected and connected\ngraph G = (V, E), where V = {1, . . . , n} represents a set of n users, and the edges in E represent\nthe social links over pairs of users. Recall that a graph G can be equivalently de\ufb01ned in terms\ni,j=1, where Li,i is the degree of node i (i.e., the number of\nincoming/outgoing edges) and, for i (cid:54)= j, Li,j equals \u22121 if (i, j) \u2208 E, and 0 otherwise. Learning\nproceeds in a sequential fashion: At each time step t = 1, 2, . . . , the learner receives a user index\nit \u2208 V together with a set of context vectors Cit = {xt,1, xt,2, . . . , xt,ct} \u2286 Rd. The learner then\nselects some kt \u2208 Cit to recommend to user it and observes some payoff at \u2208 [\u22121, 1], a function\n\n2\n\n\fof it and \u00afxt = xt,kt. No assumptions whatsoever are made on the way index it and set Cit are\ngenerated, in that they can arbitrarily depend on past choices made by the algorithm.1\nA standard modeling assumption for bandit problems with contextual information (one that is also\nadopted here) is to assume that rewards are generated by noisy versions of unknown linear func-\ntions of the context vectors. That is, we assume each node i \u2208 V hosts an unknown parame-\nter vector ui \u2208 Rd, and that the reward value ai(x) associated with node i and context vector\nx \u2208 Rd is given by the random variable ai(x) = u(cid:62)\ni x + \u0001i(x), where \u0001i(x) is a conditionally\nzero-mean and bounded variance noise term. Speci\ufb01cally, denoting by Et[\u00b7 ] the conditional expec-\nthat for any \ufb01xed i \u2208 V and x \u2208 Rd, the variable \u0001i(x) is conditionally sub-Gaussian with vari-\nance parameter \u03c32 > 0, namely, Et\nThis implies Et[\u0001i(x)] = 0 and Vt\n\ntation E(cid:2)\u00b7(cid:12)(cid:12) (i1, Ci1 , a1), . . . , (it\u22121, Cit\u22121 , at\u22121)(cid:3), we take the general approach of [1], and assume\n(cid:2)exp(\u03b3 \u0001i(x))(cid:3) \u2264 exp(cid:0)\u03c32 \u03b32/2(cid:1) for all \u03b3 \u2208 R and all x, i.\n(cid:2)\u0001i(x)(cid:3) \u2264 \u03c32, where Vt[\u00b7] is a shorthand for the conditional\nvariance V(cid:2)\u00b7(cid:12)(cid:12) (i1, Ci1, a1), . . . , (it\u22121, Cit\u22121 , at\u22121)(cid:3). So we clearly have Et[ai(x)] = u(cid:62)\n(cid:2)ai(x)(cid:3) \u2264 \u03c32. Therefore, u(cid:62)\n\ni x and\nVt\ni x is the expected reward observed at node i for context vector x.\nIn the special case when the noise \u0001i(x) is a bounded random variable taking values in the range\n[\u22121, 1], this implies Vt[ai(x)] \u2264 4.\nThe regret rt of the learner at time t is the amount by which the average reward of the best choice in\nhindsight at node it exceeds the average reward of the algorithm\u2019s choice, i.e.,\n\n(cid:18)\n\n(cid:19)\n\nrt =\n\nmax\nx\u2208Cit\n\nu(cid:62)\n\nit\n\nx\n\n\u2212 u(cid:62)\n\nit\n\n\u00afxt .\n\nlative regret(cid:80)T\n\nThe goal of the algorithm is to bound with high probability (over the noise variables \u0001it) the cumu-\nt=1 rt for the given sequence of nodes i1, . . . , iT and observed context vector sets\nCi1 , . . . , CiT . We model the similarity among users in V by making the assumption that nearby\nusers hold similar underlying vectors ui, so that reward signals received at a given node it at time\nt are also, to some extent, informative to learn the behavior of other users j connected to it within\nG. We make this more precise by taking the perspective of known multitask learning settings (e.g.,\n[8]), and assume that\n\nis small compared to(cid:80)\n\n(1)\ni\u2208V (cid:107)ui(cid:107)2, where (cid:107)\u00b7(cid:107) denotes the standard Euclidean norm of vectors. That\nis, although (1) may possibly contain a quadratic number of terms, the closeness of vectors lying\non adjacent nodes in G makes this sum comparatively smaller than the actual length of such vec-\ntors. This will be our working assumption throughout, one that motivates the Laplacian-regularized\nalgorithm presented in Section 4, and empirically tested in Section 5.\n\n(cid:107)ui \u2212 uj(cid:107)2\n\n(cid:88)\n\n(i,j)\u2208E\n\n4 Algorithm and regret analysis\n\nOur bandit algorithm maintains at time t an estimate wi,t for vector ui. Vectors wi,t are updated\nbased on the reward signals as in a standard linear bandit algorithm (e.g., [9]) operating on the\ncontext vectors contained in Cit. Every node i of G hosts a linear bandit algorithm like the one\ndescribed in Figure 1. The algorithm in Figure 1 maintains at time t a prototype vector wt which\nis the result of a standard linear least-squares approximation to the unknown parameter vector u\nassociated with the node under consideration. In particular, wt\u22121 is obtained by multiplying the\ninverse correlation matrix Mt\u22121 and the bias vector bt\u22121. At each time t = 1, 2, . . . , the algorithm\nreceives context vectors xt,1, . . . , xt,ct contained in Ct, and must select one among them. The\nlinear bandit algorithm selects \u00afxt = xt,kt as the vector in Ct that maximizes an upper-con\ufb01dence-\ncorrected estimation of the expected reward achieved over context vectors xt,k. The estimation\nis based on the current wt\u22121, while the upper con\ufb01dence level CBt is suggested by the standard\nanalysis of linear bandit algorithms \u2014see, e.g., [1, 9, 10]. Once the actual reward at associated with\n\u00afxt is observed, the algorithm uses \u00afxt for updating Mt\u22121 to Mt via a rank-one adjustment, and bt\u22121\nto bt via an additive update whose learning rate is precisely at. This algorithm can be seen as a\nversion of LinUCB [9], a linear bandit algorithm derived from LinRel [4].\n\n1 Formally, it and Cit can be arbitrary (measurable) functions of past rewards a1, . . . , at\u22121, indices\n\ni1, . . . , it\u22121, and sets Ci1 , . . . , Cit\u22121.\n\n3\n\n\fInit: b0 = 0 \u2208 Rd and M0 = I \u2208 Rd\u00d7d;\nfor t = 1, 2, . . . , T do\nSet wt\u22121 = M\nGet context Ct = {xt,1, . . . , xt,ct};\nSet\n\n\u22121\nt\u22121 bt\u22121;\n\n(cid:16)\n\nkt = argmax\nk=1,...,ct\n\n(cid:62)\nt\u22121xt,k + CBt(xt,k)\n\nw\n\n(cid:17)\n\nwhere\n\nCBt(xt,k) =\n\n(cid:113)\n\nx(cid:62)\nt,kM\n\n\u22121\nt\u22121xt,k\n\n\uf8eb\uf8ed\u03c3\n\n(cid:115)\n\n|Mt|\n\u03b4\n\nln\n\n\uf8f6\uf8f8\n\n+ (cid:107)u(cid:107)\n\nSet \u00afxt = xt,kt ;\nObserve reward at \u2208 [\u22121, 1];\nUpdate\n\n\u2022 Mt = Mt\u22121 + \u00afxt \u00afx(cid:62)\nt ,\n\u2022\n\nbt = bt\u22121 + at \u00afxt .\n\nend for\n\nFigure 1: Pseudocode of the linear bandit algo-\nrithm sitting at each node i of the given graph.\n\n(xt,ct ), and modi-\n\n(xt,k), k = 1, . . . , ct;\n\n, where\n\n\u22121\nt\u22121 bt\u22121;\n\n(xt,1), . . . , \u03c6it\n\nInit: b0 = 0 \u2208 Rdn and M0 = I \u2208 Rdn\u00d7dn;\nfor t = 1, 2, . . . , T do\nSet wt\u22121 = M\nGet it \u2208 V , context Cit = {xt,1, . . . , xt,ct};\n\ufb01ed vectors(cid:101)\u03c6t,1, . . . ,(cid:101)\u03c6t,ct\nConstruct vectors \u03c6it\n\u22121/2\u2297 \u03c6it\n(cid:16)\n(cid:62)\n\nt\u22121(cid:101)\u03c6t,k + CBt((cid:101)\u03c6t,k)\n\uf8eb\uf8ed\u03c3\nt\u22121(cid:101)\u03c6t,k\n\u2022 Mt = Mt\u22121 +(cid:101)\u03c6t,kt(cid:101)\u03c6\nbt = bt\u22121 + at(cid:101)\u03c6t,k .\n\nObserve reward at \u2208 [\u22121, 1] at node it;\nUpdate\n(cid:62)\nt,kt\n\n(cid:101)\u03c6t,k = A\n(cid:113)(cid:101)\u03c6\n\nCBt((cid:101)\u03c6t,k) =\n\nSet kt = argmax\nk=1,...,ct\n\n|Mt|\n\u03b4\n\n(cid:62)\nt,kM\n\n(cid:115)\n\n\u22121\n\n\u2022\n\nw\n\nln\n\n,\n\nend for\n\nwhere\n\n(cid:17)\n\uf8f6\uf8f8\n+ (cid:107)(cid:101)U(cid:107)\n\nFigure 2: Pseudocode of the GOB.Lin algo-\nrithm.\n\nWe now turn to describing our GOB.Lin (Gang Of Bandits, Linear version) algorithm. GOB.Lin lets\nthe algorithm in Figure 1 operate on each node i of G (we should then add subscript i throughout,\nreplacing wt by wi,t, Mt by Mi,t, and so forth). The updates Mi,t\u22121 \u2192 Mi,t and bi,t\u22121 \u2192 bi,t\nare performed at node i through vector \u00afxt both when i = it (i.e., when node i is the one which the\ncontext vectors in Cit refer to) and to a lesser extent when i (cid:54)= it (i.e., when node i is not the one\nwhich the vectors in Cit refer to). This is because, as we said, the payoff at received for node it is\nsomehow informative also for all other nodes i (cid:54)= it. In other words, because we are assuming the\nunderlying parameter vectors ui are close to each other, we should let the corresponding prototype\nvectors wi,t undergo similar updates, so as to also keep the wi,t close to each other over time.\nWith this in mind, we now describe GOB.Lin in more detail. It is convenient to introduce \ufb01rst some\nextra matrix notation. Let A = In + L, where L is the Laplacian matrix associated with G, and\nIn is the n \u00d7 n identity matrix. Set A\u2297 = A \u2297 Id, the Kronecker product2 of matrices A and Id.\nMoreover, the \u201ccompound\u201d descriptor for the pairing (i, x) is given by the long (and sparse) vector\n\u03c6i(x) \u2208 Rdn de\ufb01ned as\n\n\u03c6i(x)(cid:62) =(cid:0) 0, . . . , 0\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n(i\u22121)d times\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n, x(cid:62), 0, . . . , 0\n(n\u2212i)d times\n\n(cid:1) .\n\n1 , u(cid:62)\n\n2 , . . . , u(cid:62)\n\nU = (u(cid:62)\nin the two expressions for CBt can be replaced by suitable upper bounds.\n\nWith the above notation handy, a compact description of GOB.Lin is presented in Figure 2, where\nwe deliberately tried to mimic the pseudocode of Figure 1. Notice that in Figure 2 we overloaded\nthe notation for the con\ufb01dence bound CBt, which is now de\ufb01ned in terms of the Laplacian L of G.\n\nIn particular, (cid:107)u(cid:107) in Figure 1 is replaced in Figure 2 by (cid:107)(cid:101)U(cid:107), where (cid:101)U = A1/2\u2297 U and we de\ufb01ne\nn )(cid:62) \u2208 Rdn. Clearly enough, the potentially unknown quantities (cid:107)u(cid:107) and (cid:107)(cid:101)U(cid:107)\nWe now explain how the modi\ufb01ed long vectors(cid:101)\u03c6t,k = A\n\u22121/2\u2297 \u03c6it(xt,k) act in the update of matrix\nwhose i-th block Di is the d \u00d7 d matrix Di = Id +(cid:80)\nMt and vector bt. First, observe that if A\u2297 were the identity matrix then, according to how the\ndn-long vector whose i-th d-dimensional block contains(cid:80)\nlong vectors \u03c6it(xt,k) are de\ufb01ned, Mt would be a block-diagonal matrix Mt = diag(D1, . . . , Dn),\nt . Similarly, bt would be the\nt : kt=i atxt. This would be equivalent\nto running n independent linear bandit algorithms (Figure 1), one per node of G. Now, because\nA\u2297 is not the identity, but contains graph G represented through its Laplacian matrix, the selected\nvector xt,kt \u2208 Cit for node it gets spread via A\nfrom the it-th block over all other blocks,\nthereby making the contextual information contained in xt,kt available to update the internal status\n2 The Kronecker product between two matrices M \u2208 Rm\u00d7n and N \u2208 Rq\u00d7r is the block matrix M \u2297 N of\n\nt : kt=i xtx(cid:62)\n\ndimension mq \u00d7 nr whose block on row i and column j is the q \u00d7 r matrix Mi,jN.\n\n\u22121/2\u2297\n\n4\n\n\f(cid:115)\n\n(cid:18)\n\nrt \u2264 2\n\nT\n\n2\u03c32 ln\n\n|MT|\n\u03b4\n\nT(cid:88)\n\nt=1\n\n(cid:19)\n\n+ 2L(u1, . . . , un)\n\n(1 + B2) ln|MT|\n\nof all other nodes. Yet, the only reward signal observed at time t is the one available at node it. A\ntheoretical analysis of GOB.Lin relying on the learning model of Section 3 is sketched in Section 4.1.\nGOB.Lin\u2019s running time is mainly affected by the inversion of the dn\u00d7 dn matrix Mt, which can be\nperformed in time of order (dn)2 per round by using well-known formulas for incremental matrix\ninversions. The same quadratic dependence holds for memory requirements. In our experiments, we\nobserved that projecting the contexts on the principal components improved performance. Hence,\nthe quadratic dependence on the context vector dimension d is not really hurting us in practice. On\nthe other hand, the quadratic dependence on the number of nodes n may be a signi\ufb01cant limitation\nto GOB.Lin\u2019s practical deployment. In the next section, we show that simple graph compression\nschemes (like node clustering) can conveniently be applied to both reduce edge noise and bring the\nalgorithm to reasonable scaling behaviors.\n\n4.1 Regret Analysis\n\nWe now provide a regret analysis for GOB.Lin that relies on the high probability analysis contained\nin [1] (Theorem 2 therein). The analysis can be seen as a combination of the multitask kernel\ncontained in, e.g., [8, 17, 12] and a version of the linear bandit algorithm described and analyzed\nin [1].\nTheorem 1. Let the GOB.Lin algorithm of Figure 2 be run on graph G = (V, E), V = {1, . . . , n},\nhosting at each node i \u2208 V vector ui \u2208 Rd. De\ufb01ne\n\nL(u1, . . . , un) =\n\n(cid:107)ui(cid:107)2 +\n\n(cid:107)ui \u2212 uj(cid:107)2 .\n\n(cid:88)\n\ni\u2208V\n\n(cid:88)\n\n(i,j)\u2208E\n\nLet also the sequence of context vectors xt,k be such that (cid:107)xt,k(cid:107) \u2264 B, for all k = 1, . . . , ct, and\nt = 1, . . . , T . Then the cumulative regret satis\ufb01es\n\nmatrix), the bound in the above theorem has an extra term(cid:80)\nlog determinant ln|MT| on the resulting matrix MT , due to the construction of (cid:101)\u03c6t,k via A\n\nwith probability at least 1 \u2212 \u03b4.\nCompared to running n independent bandit algorithms (which corresponds to A\u2297 being the identity\n(i,j)\u2208E (cid:107)ui \u2212 uj(cid:107)2, which we assume\nsmall according to our working assumption. However, the bound has also a signi\ufb01cantly smaller\n. In\nparticular, when the graph is very dense, the log determinant in GOB.Lin is a factor n smaller than\nthe corresponding term for the n independent bandit case (see, e.g.,[8], Section 4.2 therein). To make\nthings clear, consider two extreme situations. When G has no edges then TR(MT ) = TR(I) + T =\nnd+T , hence ln|MT| \u2264 dn ln(1+T /(dn)). On the other hand, When G is the complete graph then\nTR(MT ) = TR(I) + 2t/(n + 1) = nd + 2T /(n + 1), hence ln|MT| \u2264 dn ln(1 + 2T /(dn(n + 1))).\nThe exact behavior of ln|Mt| (one that would ensure a signi\ufb01cant advantage in practice) depends\non the actual interplay between the data and the graph, so that the above linear dependence on dn is\nreally a coarse upper bound.\n\n\u22121/2\u2297\n\n5 Experiments\n\nt,kM\u22121\nx(cid:62)\n\nIn this section, we present an empirical comparison of GOB.Lin (and its variants) to linear ban-\ndit algorithms which do not exploit the relational information provided by the graph. We run\nour experiments by approximating the CBt function in Figure 1 with the simpli\ufb01ed expression\nt\u22121xt,k log(t + 1), and the CBt function in Figure 2 with the corresponding expression\n\u03b1\n\n(cid:113)\nin which xt,k is replaced by (cid:101)\u03c6t,k. In both cases, the factor \u03b1 is used as tunable parameter. Our\n\npreliminary experiments show that this approximation does not affect the predictive performances\nof the algorithms, while it speeds up computation signi\ufb01cantly. We tested our algorithm and its\ncompetitors on a synthetic dataset and two freely available real-world datasets extracted from the\nsocial bookmarking web service Delicious and from the music streaming service Last.fm. These\ndatasets are structured as follows.\n\n5\n\n\f4Cliques. This is an arti\ufb01cial dataset whose graph contains four cliques of 25 nodes each to which\nwe added graph noise. This noise consists in picking a random pair of nodes and deleting or creating\nan edge between them. More precisely, we created a n\u00d7 n symmetric noise matrix of random num-\nbers in [0, 1], and we selected a threshold value such that the expected number of matrix elements\nabove this value is exactly some chosen noise rate parameter. Then we set to 1 all the entries whose\ncontent is above the threshold, and to zero the remaining ones. Finally, we XORed the noise matrix\nwith the graph adjacency matrix, thus obtaining a noisy version of the original graph.\nLast.fm. This is a social network containing 1,892 nodes and 12,717 edges. There are 17,632 items\n(artists), described by 11,946 tags. The dataset contains information about the listened artists, and\nwe used this information in order to create the payoffs: if a user listened to an artist at least once the\npayoff is 1, otherwise the payoff is 0.\nDelicious. This is a network with 1,861 nodes and 7,668 edges. There are 69,226 items (URLs)\ndescribed by 53,388 tags. The payoffs were created using the information about the bookmarked\nURLs for each user: the payoff is 1 if the user bookmarked the URL, otherwise the payoff is 0.\nLast.fm and Delicious were created by the Information Retrieval group at Universidad Autonoma\nde Madrid for the HetRec 2011 Workshop [6] with the goal of investigating the usage of heteroge-\nneous information in recommendation systems.3 These two networks are structurally different: on\nDelicious, payoffs depend on users more strongly than on Last.fm. In other words, there are more\npopular artists, whom everybody listens to, than popular websites, which everybody bookmarks \u2014\nsee Figure 3. This makes a huge difference in practice, and the choice of these two datasets allow\nus to make a more realistic comparison of recommendation techniques. Since we did not remove\nany items from these datasets (neither the most frequent nor the least frequent), these differences do\nin\ufb02uence the behavior of all algorithms \u2014see below.\nSome statistics about Last.fm and Delicious are reported in Table 1. In Figure 3 we plotted the\ndistribution of the number of preferences per item in order to make clear and visible the differences\nexplained in the previous paragraphs.4\n\nNODES\nEDGES\nAVG. DEGREE\nITEMS\nNONZERO PAYOFFS\nTAGS\n\nLAST.FM DELICIOUS\n1867\n7668\n8.21\n69226\n104799\n53388\n\n1892\n12717\n13.443\n17632\n92834\n11946\n\nTable 1: Main statistics for Last.fm and Delicious. ITEMS\ncounts the overall number of items, across all users, from\nwhich Ct is selected. NONZERO PAYOFFS is the number\nof pairs (user, item) for which we have a nonzero payoff.\nTAGS is the number of distinct tags that were used to de-\nscribe the items.\n\nFigure 3: Plot of the number of prefer-\nences per item (users who bookmarked\nthe URL or listened to an artist). Both\naxes have logarithmic scale.\n\nWe preprocessed datasets by breaking down the tags into smaller tags made up of single words. In\nfact, many users tend to create tags like \u201cwebdesign tutorial css\u201d. This tag has been splitted into\nthree smaller tags corresponding to the three words therein. More generally, we splitted all com-\npound tags containing underscores, hyphens and apexes. This makes sense because users create tags\nindependently, and we may have both \u201crock and roll\u201d and \u201crock n roll\u201d. Because of this splitting\noperation, the number of unique tags decreased from 11,946 to 6,036 on Last.fm and from 53,388\nto 9,949 on Delicious. On Delicious, we also removed all tags occurring less than ten times.5 The\n\n3 Datasets and their full descriptions are available at www.grouplens.org/node/462.\n4 In the context of recommender systems, these two datasets may be seen as representatives of two \u201cmarkets\u201d\nwhose products have signi\ufb01cantly different market shares (the well-known dichotomy of hit vs. niche products).\nNiche product markets give rise to power laws in user preference statistics (as in the blue plot of Figure 3).\n\n5 We did not repeat the same operation on Last.fm because this dataset was already extremely sparse.\n\n6\n\n 1 10 100 1000 1 10 100 1000 10000 100000NUM PREFERENCESITEM IDPreferences per itemDELICIOUSLASTFM\fTable 2: Normalized cumulated reward for different levels of graph noise (expected fraction of\nperturbed edges) and payoff noise (largest absolute value of noise term \u0001) on the 4Cliques dataset.\nGraph noise increases from top to bottom, payoff noise increases from left to right. GOB.Lin is\nclearly more robust to payoff noise than its competitors. On the other hand, GOB.Lin is sensitive to\nhigh levels of graph noise. In the last row, graph noise is 41.7%, i.e., the number of perturbed edges\nis 500 out of 1200 edges of the original graph.\n\nalgorithms we tested do not use any prior information about which user provided a speci\ufb01c tag. We\nused all tags associated with a single item to create a TF-IDF context vector that uniquely represents\nthat item, independent of which user the item is proposed to. In both datasets, we only retained\nthe \ufb01rst 25 principal components of context vectors, so that xt,k \u2208 R25 for all t and k. We gen-\nerated random context sets Cit of size 25 for Last.fm and Delicious, and of size 10 for 4Cliques.\nIn practical scenarios, these numbers would be varying over time, but we kept them \ufb01xed so as\nto simplify the experimental setting. In 4Cliques we assigned the same unit norm random vector\nui to every node in the same clique i of the original graph (before adding graph noise). Payoffs\nwere then generated according to the following stochastic model: ai(x) = u(cid:62)\ni x + \u0001, where \u0001 (the\npayoff noise) is uniformly distributed in a bounded interval centered around zero. For Delicious\nand Last.fm, we created a set of context vectors for every round t as follows: we \ufb01rst picked it\nuniformly at random in {1, . . . , n}. Then, we generated context vectors xt,1, . . . , xt,25 in Cit by\npicking 24 vectors at random from the dataset and one among those vectors with nonzero payoff\nfor user it. This is necessary in order to avoid a meaningless comparison: with high probability, a\npurely random selection would result in payoffs equal to zero for all the context vectors in Cit. In\nour experimental comparison, we tested GOB.Lin and its variants against two baselines: a baseline\nLinUCB-IND that runs an independent instance of the algorithm in Figure 1 at each node (this is\nequivalent to running GOB.Lin in Figure 2 with A\u2297 = Idn) and a baseline LinUCB-SIN, which\nruns a single instance of the algorithm in Figure 1 shared by all the nodes. LinUCB-IND turns to be\n\n7\n\n 0 500 1000 1500 2000 2500 3000 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIME4Cliques graph-noise=0% payoff-noise=0GOB.LinLinUCB-INDLinUCB-SIN 0 500 1000 1500 2000 2500 3000 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIME4Cliques graph-noise=0% payoff-noise=0.25GOB.LinLinUCB-INDLinUCB-SIN 0 500 1000 1500 2000 2500 3000 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIME4Cliques graph-noise=0% payoff-noise=0.5GOB.LinLinUCB-INDLinUCB-SIN 0 500 1000 1500 2000 2500 3000 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIME4Cliques graph-noise=8.3% payoff-noise=0GOB.LinLinUCB-INDLinUCB-SIN 0 500 1000 1500 2000 2500 3000 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIME4Cliques graph-noise=8.3% payoff-noise=0.25GOB.LinLinUCB-INDLinUCB-SIN 0 500 1000 1500 2000 2500 3000 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIME4Cliques graph-noise=8.3% payoff-noise=0.5GOB.LinLinUCB-INDLinUCB-SIN 0 500 1000 1500 2000 2500 3000 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIME4Cliques graph-noise=20.8% payoff-noise=0GOB.LinLinUCB-INDLinUCB-SIN 0 500 1000 1500 2000 2500 3000 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIME4Cliques graph-noise=20.8% payoff-noise=0.25GOB.LinLinUCB-INDLinUCB-SIN 0 500 1000 1500 2000 2500 3000 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIME4Cliques graph-noise=20.8% payoff-noise=0.5GOB.LinLinUCB-INDLinUCB-SIN 0 500 1000 1500 2000 2500 3000 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIME4Cliques graph-noise=41.7% payoff-noise=0GOB.LinLinUCB-INDLinUCB-SIN 0 500 1000 1500 2000 2500 3000 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIME4Cliques graph-noise=41.7% payoff-noise=0.25GOB.LinLinUCB-INDLinUCB-SIN 0 500 1000 1500 2000 2500 3000 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIME4Cliques graph-noise=41.7% payoff-noise=0.5GOB.LinLinUCB-INDLinUCB-SIN\fFigure 4: Cumulative reward for all the bandit algorithms introduced in this section.\n\na reasonable comparator when, as in the Delicious dataset, there are many moderately popular items.\nOn the other hand, LinUCB-SIN is a competitive baseline when, as in the Last.fm dataset, there are\nfew very popular items. The two scalable variants of GOB.Lin which we empirically analyzed are\nbased on node clustering,6 and are de\ufb01ned as follows.\nGOB.Lin.MACRO: GOB.Lin is run on a weighted graph whose nodes are the clusters of the origi-\nnal graph. The edges are weighted by the number of inter-cluster edges in the original graph. When\nall nodes are clustered together, then GOB.Lin.MACRO recovers the baseline LinUCB-SIN as a\nspecial case. In order to strike a good trade-off between the speed of the algorithms and the loss of\ninformation resulting from clustering, we tested three different cluster sizes: 50, 100, and 200. Our\nplots refer to the best performing choice.\nGOB.Lin.BLOCK: GOB.Lin is run on a disconnected graph whose connected components are the\nclusters. This makes A\u2297 and Mt (Figure 2) block-diagonal matrices. When each node is clustered\nindividually, then GOB.Lin.BLOCK recovers the baseline LinUCB-IND as a special case. Similar\nto GOB.Lin.MACRO, in order to trade-off running time and cluster sizes, we tested three different\ncluster sizes (5, 10, and 20), and report only on the best performing choice.\nAs the running time of GOB.Lin scales quadratically with the number of nodes, the computational\nsavings provided by the clustering are also quadratic. Moreover, as we will see in the experiments,\nthe clustering acts as a regularizer, limiting the in\ufb02uence of noise. In all cases, the parameter \u03b1 in\n\nFigures 1 and 2 was selected based on the scale of instance vectors \u00afxt and (cid:101)\u03c6t,kt, respectively, and\nrithm, as compared (\u201cnormalized\u201d) to that of the random predictor, that is(cid:80)\ntuned across appropriate ranges. Table 2 and Figure 4 show the cumulative reward for each algo-\nt(at \u2212 \u00afat), where at is\nthe payoff obtained by the algorithm and \u00afat is the payoff obtained by the random predictor, i.e., the\naverage payoff over the context vectors available at time t. Table 2 (synthetic datasets) shows that\nGOB.Lin and LinUCB-SIN are more robust to payoff noise than LinUCB-IND. Clearly, LinUCB-\nSIN is also unaffected by graph noise, but it never outperforms GOB.Lin. When the payoff noise is\nlow and the graph noise grows GOB.Lin\u2019s performance tends to degrade. Figure 4 reports the results\non the two real-world datasets. Notice that GOB.Lin and its variants always outperform the baselines\n(not relying on graphical information) on both datasets. As expected, GOB.Lin.MACRO works best\non Last.fm, where many users gave positive payoffs to the same few items. Hence, macro nodes\napparently help GOB.Lin.MACRO to perform better than its corresponding baseline LinUCB-SIN.\nIn fact, GOB.Lin.MACRO also outperforms GOB.Lin, thus showing the regularization effect of us-\ning macro nodes. On Delicious, where we have many moderately popular items, GOB.Lin.BLOCK\ntends to perform best, GOB.Lin being the runner-up. As expected, LinUCB-IND works better than\nLinUCB-SIN, since the former is clearly more prone to personalize item recommendation than the\nlatter. Future work will consider experiments against different methods for sharing contextual and\nfeedback information in a set of users, such as the feature hashing technique of [22].\nAcknowledgments\nNCB and GZ gratefully acknowledge partial support by MIUR (project ARS TechnoMedia, PRIN\n2010-2011, contract no. 2010N5K7EB-003). We thank the Laboratory for Web Algorithmics at\nDept. of Computer Science of University of Milan.\n\n6 We used the freely available Graclus (see e.g. [11]) graph clustering tool with normalized cut, zero local\n\nsearch steps, and no spectral clustering options.\n\n8\n\n 0 250 500 750 1000 1250 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIMELast.fmLinUCB-SINLinUCB-INDGOB.LinGOB.Lin.MACROGOB.Lin.BLOCK 0 25 50 75 100 125 150 0 2000 4000 6000 8000 10000CUMULATIVE REWARDTIMEDeliciousLinUCB-SINLinUCB-INDGOB.LinGOB.Lin.MACROGOB.Lin.BLOCK\fReferences\n[1] Y. Abbasi-Yadkori, D. P\u00b4al, and C. Szepesv\u00b4ari. Improved algorithms for linear stochastic bandits. Advances\n\nin Neural Information Processing Systems, 2011.\n\n[2] K. Amin, M. Kearns, and U. Syed. Graphical models for bandit problems. Proceedings of the Twenty-\n\nSeventh Conference Uncertainty in Arti\ufb01cial Intelligence, 2011.\n\n[3] D. Asanov. Algorithms and methods in recommender systems. Berlin Institute of Technology, Berlin,\n\nGermany, 2011.\n\n[4] P. Auer. Using con\ufb01dence bounds for exploration-exploitation trade-offs. Journal of Machine Learning\n\nResearch, 3:397\u2013422, 2002.\n\n[5] T. Bogers. Movie recommendation using random walks over the contextual graph. In CARS\u201910: Pro-\n\nceedings of the 2nd Workshop on Context-Aware Recommender Systems, 2010.\n\n[6] I. Cantador, P. Brusilovsky, and T. Ku\ufb02ik. 2nd Workshop on Information Heterogeneity and Fusion in\nRecommender Systems (HetRec 2011). In Proceedings of the 5th ACM Conference on Recommender\nSystems, RecSys 2011. ACM, 2011.\n\n[7] S. Caron, B. Kveton, M. Lelarge, and S. Bhagat. Leveraging side observations in stochastic bandits. In\n\nProceedings of the 28th Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 142\u2013151, 2012.\n\n[8] G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Linear algorithms for online multitask classi\ufb01cation.\n\nJournal of Machine Learning Research, 11:2597\u20132630, 2010.\n\n[9] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions. In Pro-\n\nceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics, pages 208\u2013214, 2011.\n\n[10] K. Crammer and C. Gentile. Multiclass classi\ufb01cation with bandit feedback using adaptive regularization.\n\nMachine Learning, 90(3):347\u2013383, 2013.\n\n[11] I. S. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors a multilevel approach.\n\nPattern Analysis and Machine Intelligence, IEEE Transactions on, 29(11):1944\u20131957, 2007.\n\n[12] T. Evgeniou and M. Pontil. Regularized multi\u2013task learning. In Proceedings of the tenth ACM SIGKDD\ninternational conference on Knowledge discovery and data mining, KDD \u201904, pages 109\u2013117, New York,\nNY, USA, 2004. ACM.\n\n[13] I. Guy, N. Zwerdling, D. Carmel, I. Ronen, E. Uziel, S. Yogev, and S. Ofek-Koifman. Personalized\nrecommendation of social software items based on social relations. In Proceedings of the Third ACM\nConference on Recommender Sarxiv ystems, pages 53\u201360. ACM, 2009.\n\n[14] S. Kar, H. V. Poor, and S. Cui. Bandit problems in networks: Asymptotically ef\ufb01cient distributed allo-\ncation rules. In Decision and Control and European Control Conference (CDC-ECC), 2011 50th IEEE\nConference on, pages 1771\u20131778. IEEE, 2011.\n\n[15] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news\narticle recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages\n661\u2013670. ACM, 2010.\n\n[16] S. Mannor and O. Shamir. From bandits to experts: On the value of side-observations. In Advances in\n\nNeural Information Processing Systems, pages 684\u2013692, 2011.\n\n[17] C. A. Micchelli and M. Pontil. Kernels for multi\u2013task learning.\n\nProcessing Systems, pages 921\u2013928, 2004.\n\nIn Advances in Neural Information\n\n[18] A. Said, E. W. De Luca, and S. Albayrak. How social relationships affect user similarities. In Proceedings\n\nof the 2010 Workshop on Social Recommender Systems, pages 1\u20134, 2010.\n\n[19] A. Slivkins. Contextual bandits with similarity information. Journal of Machine Learning Research \u2013\n\nProceedings Track, 19:679\u2013702, 2011.\n\n[20] B. Swapna, A. Eryilmaz, and N. B. Shroff. Multi-armed bandits in the presence of side observations in\n\nsocial networks. In Proceedings of 52nd IEEE Conference on Decision and Control (CDC), 2013.\n\n[21] B. Sz\u00a8or\u00b4enyi, R. Busa-Fekete, I. Hegedus, R. Orm\u00b4andi, M. Jelasity, and B. K\u00b4egl. Gossip-based distributed\nstochastic bandit algorithms. Proceedings of the 30th International Conference on Machine Learning,\n2013.\n\n[22] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale\nIn Proceedings of the 26th International Conference on Machine Learning, pages\n\nmultitask learning.\n1113\u20131120. Omnipress, 2009.\n\n9\n\n\f", "award": [], "sourceid": 427, "authors": [{"given_name": "Nicol\u00f2", "family_name": "Cesa-Bianchi", "institution": "University of Milan"}, {"given_name": "Claudio", "family_name": "Gentile", "institution": "University of Insubria"}, {"given_name": "Giovanni", "family_name": "Zappella", "institution": "University of Milan"}]}