{"title": "Linear Submodular Bandits and their Application to Diversified Retrieval", "book": "Advances in Neural Information Processing Systems", "page_first": 2483, "page_last": 2491, "abstract": "Diversified retrieval and online learning are two core research areas in the design of modern information retrieval systems.In this paper, we propose the linear submodular bandits problem, which is an online learning setting for optimizing a general class of feature-rich submodular utility models for diversified retrieval. We present an algorithm, called LSBGREEDY, and prove that it efficiently converges to a near-optimal model. As a case study, we applied our approach to the setting of personalized news recommendation, where the system must recommend small sets of news articles selected from tens of thousands of available articles each day. In a live user study, we found that LSBGREEDY significantly outperforms existing online learning approaches.", "full_text": "Linear Submodular Bandits\n\nand their Application to Diversi\ufb01ed Retrieval\n\nYisong Yue\n\niLab, Heinz College\n\nCarnegie Mellon University\nyisongyue@cmu.edu\n\nCarlos Guestrin\n\nMachine Learning Department\nCarnegie Mellon University\nguestrin@cs.cmu.edu\n\nAbstract\n\nDiversi\ufb01ed retrieval and online learning are two core research areas in the design\nof modern information retrieval systems. In this paper, we propose the linear sub-\nmodular bandits problem, which is an online learning setting for optimizing a gen-\neral class of feature-rich submodular utility models for diversi\ufb01ed retrieval. We\npresent an algorithm, called LSBGREEDY, and prove that it ef\ufb01ciently converges\nto a near-optimal model. As a case study, we applied our approach to the setting\nof personalized news recommendation, where the system must recommend small\nsets of news articles selected from tens of thousands of available articles each day.\nIn a live user study, we found that LSBGREEDY signi\ufb01cantly outperforms existing\nonline learning approaches.\n\n1\n\nIntroduction\n\nUser feedback has become an invaluable source of training data for optimizing information retrieval\nsystems in a rapidly expanding range of domains, most notably content recommendation (e.g., news,\nmovies, ads). When designing retrieval systems that adapt to user feedback, two important chal-\nlenges arise. First, the system should recommend optimally diversi\ufb01ed content that maximizes\ncoverage of the information the user \ufb01nds interesting (to maximize positive feedback). Second, the\nsystem should make exploratory recommendations in order to learn a reliable model from feedback.\nChallenge 1: diversi\ufb01cation. In most retrieval settings, the retrieval system must recommend sets of\narticles, rather than individual articles. Furthermore, the recommended articles should be well diver-\nsi\ufb01ed. This is motivated by the principle that recommending redundant articles leads to diminishing\nreturns on utility, since users need to consume redundant information only once. This notion of di-\nminishing returns is well-captured by submodular utility models, which have become an increasingly\npopular approach to modeling diversi\ufb01ed retrieval tasks in recent years [24, 25, 18, 3, 21, 9, 16].\nChallenge 2: feature-based exploration.\nIn most retrieval settings, users typically only provide\nfeedback on the articles recommended to them. This partial feedback issue leads to an inherent\ntension between exploration and exploitation when deciding which articles to recommend to the\nuser. Furthermore, it is typically desirable to learn a feature-based model that can generalize to new\nor previously unseen articles and users; this is often called the contextual bandits problem [13, 15, 7].\nAlthough there exist approaches that have addressed these challenges individually, to our knowledge\nthere is no single approach which solves both simultaneously and is also practical to implement. For\ninstance, existing online approaches for optimizing submodular functions typically assume a feature-\nfree model, and thus cannot generalize easily [18, 22, 23]. Such approaches measure performance\nrelative to the single best set (e.g., of articles). Thus, they are not suitable for many retrieval settings\nsince the set of available articles can change frequently (e.g., news recommendation).\nIn this paper, we address both challenges in a uni\ufb01ed framework. We propose the linear submodular\nbandits problem, which is an online learning setting for optimizing a general class of feature-based\n\n1\n\n\fsubmodular utility models. To make learning practical, we represent the bene\ufb01t of adding an article\nto an existing set of selected articles as a linear model with respect to the user\u2019s preferences. This\nclass of models encompasses several existing information coverage utility models for diversi\ufb01ed\nretrieval [24, 25, 9], and allows us to learn \ufb02exible models that can generalize to new predictions.\nSimilar to the contextual bandits setting considered in [15], our setting can be characterized as a\nfeature-based exploration-exploitation problem, where the uncertainty lies in how best to model user\ninterests using the available features. In contrast to [15], we aim to recommend optimally diversi\ufb01ed\nsets of articles rather than just single articles. From that standpoint, modeling this additional layer of\ncomplexity in the bandit setting is our main technical contribution. We present an algorithm, called\nLSBGREEDY, to optimize this exploration-exploitation trade-off. When learning a d-dimensional\nmodel to recommend sets of L articles for T time steps, we prove that LSBGREEDY incurs re-\n\ngret that grows as O(dpLT ) (ignoring log factors). This regret matches the convergence rates of\n\nanalogous algorithms for the conventional linear bandits setting [1, 20, 8].\nAs a case study, we applied our approach to the setting of personalized news recommendation [9,\n15, 16]. In addition to simulation experiments, we conducted a live user study over a period of\nten rounds, where in each round the retrieval system must recommend a small set of news articles\nselected from tens of thousands of available articles for that round. We compared against existing\nonline learning approaches that either employ no exploration [9], or learn to recommend only single\narticles (and thus do not model diversity) [15]. Compared to previous approaches, we \ufb01nd that\nLSBGREEDY can signi\ufb01cantly improve the performance of the retrieval system even when learning\nfor a limited number of rounds. Our empirical results demonstrate the advantage of jointly tackling\nthe challenges of diversi\ufb01cation and feature-based exploration, as well as showcase the practicality\nof our approach.\n\n2 Submodular Information Coverage Models\n\nBefore presenting our online learning setting, we \ufb01rst describe the class of utility functions that we\noptimize over. Throughout this paper, we use personalized news recommendation as our motivating\nexample. In this setting, utility corresponds to the amount of interesting information covered by the\nset of recommended articles.\nSuppose that news articles are represented using a set of d \u201ctopics\u201d or \u201cconcepts\u201d that we wish to\ncover (e.g., the Middle East or the weather).1\nIntuitively, recommending two articles that cover\nhighly overlapping topics might not be more bene\ufb01cial than recommending just one of the articles\n\u2013 this is the notion of diminishing returns we wish to capture in our information coverage model.\nTwo key properties we will exploit are that our utility functions are monotone and submodular. A\nset function F mapping sets of recommended articles A to real values (e.g., the total information\ncovered by A) is monotone and submodular if and only if\n\nF (A [{ a}) F (A)\n\nand F (A [{ a}) F (A) F (B [{ a}) F (B),\n\nrespectively, for all articles a and sets A \u2713 B. In other words, since A is smaller than B, the bene\ufb01t\nof adding a to A is larger than the bene\ufb01t of adding a to B. Submodularity provides a natural\nframework for characterizing diminishing returns in information coverage, since the gain of adding\na second (redundant) article on a topic will be smaller than the gain of adding the \ufb01rst.\nFor each topic i, let Fi(A) be a monotone submodular function corresponding to how well the\nrecommended articles A cover topic i. We write the total utility of recommending A as\n\nF (A|w) = w>hF1(A), . . . , Fd(A)i,\n\n(1)\n+ is a parameter vector indicating the user\u2019s interest level in each topic. Thus, F (A|w)\nwhere w 2 (a|A), where\n(3)\nIn other words, the i-th component of (a|A) corresponds to the incremental coverage (i.e., sub-\nmodular advantage) of topic i by article a, conditioned on articles A having already been selected.\nThis property will be exploited by our online learning algorithm presented in Section 4.\nOptimization. Another attractive property of monotone submodular functions is that the myopic\ngreedy algorithm is guaranteed to produce a near-optimal solution [17]. For any budget L (e.g.,\nL = 10 articles), the constrained optimization problem, argmaxA:|A|\uf8ffL F (A|w), can be solved\ngreedily to produce a solution that is within a factor (1 1/e) \u21e1 0.63 of optimal. Achieving\nbetter than (1 1/e)OP T is known to be intractable unless P = N P [10]. In practice, the greedy\nalgorithm can often perform much better than this worst case guarantee (cf. [14]), and will be a\ncentral component in our online learning algorithm.\n\n3 Problem Formulation\n\nWe propose the linear submodular bandits problem which is described in the following. At each\ntime step t = 1, . . . , T , our algorithm interacts with the user in the following way:\n\n\u2022 A set of articles At is made available to the algorithm. Each article a 2A t is represented using a set\nof d basis coverage functions F1, . . . , Fd, de\ufb01ned as in Section 2, which is known to the algorithm.\n\u2022 The algorithm chooses a ranked set of L articles, denoted At = (a(1)\n), using the basis\ncoverage functions of the articles and the outcomes of previous time steps.\n\u2022 The user provides feedback (e.g., clicks on or ignores each article), and the rewards for each recom-\n\n, . . . , a(L)\n\nmended articles rt(At) (4) are observed.\n\nt\n\nt\n\nIn order to develop our algorithm, we require a model of user behavior. We assume the user scans\nthe recommended articles A = (a(1), . . . , a(L)) one by one in top-down fashion. For each article\na(`), the user considers the new information covered by a(`) and not covered by the above articles\nA(1:`1) (A(1:`) denotes the articles in the \ufb01rst ` slots). In our representation, this new information\nis (a(`)|A(1:`1)) as in (3). The user then clicks on (or likes) a(`) with independent probability\n(w\u21e4)> (a(`)|A(1:`1)), where w\u21e4 is the hidden preferences of the user. Formally, for any set of\narticles A chosen at time t, the rewards rt(A) can be written as the sum of rewards at each slot,\n\nrt(A) =\n\nr(`)\nt (A).\n\nLX`=1\n\n(4)\n\nWe assume each r(`)\n\nt\n\nis an independent random variable bounded in [0, 1] and satis\ufb01es\n\nEhr(`)\nt (A)i = (w\u21e4)> (a(`)|A(1:`1)),\n\n(5)\nwhere w\u21e4 is a weight vector unknown to the algorithm with kw\u21e4k \uf8ff S. In other words, the expected\nreward in each slot is realizable, linear in (a(`)|A(1:`1)), and independent of the other slots. We\ncall this independence property conditional submodular independence, which we will leverage in\n\n2E.g., the topics and coverage probabilities can be derived from a topic model such as LDA [4].\n\n3\n\n\fAlgorithm 1 LSBGREEDY\n1: input: , \u21b5t\n2: for t = 1, . . . , T do\n\n`=1 (`)\n\nbt\n\nt\n\n\u2327 =1PL\n\n\u2327 \u21e3(`)\n\n\u2327 \u2318> //covariance matrix\n\n\u2327 (`)\n\u2327\n\n`=1 \u02c6r(`)\n//linear regression using previous feedback as training data\n\n//aggregate feedback so far\n\n3: Mt Id +Pt1\nbt Pt1\n\u2327 =1PL\n4:\nwt M1\n5:\n6: At ;\nfor ` = 1, . . . , L do\n7:\n8a 2A t \\ A(t) : \u00b5a w>t ( a|At) //compute mean estimate of utility gain\n8:\n8i 2A t \\ A(t) : ca \u21b5tq( a|At)> M1\n9:\n10:\n11:\n12:\n13:\n14: end for\n\nt argmaxa(\u00b5a + ca) //select article with highest upper con\ufb01dence bound\nt \u21e3 a(`)\n\nend for\nrecommend articles At in the order selected, and observe rewards \u02c6r(1)\n\n\u2318, At At [na(`)\nt o\n\nt\n\n A(1:`1)\n\nt ( a|At) //compute con\ufb01dence interval\n\nset a(`)\nstore (`)\n\nt\n\n, . . . , \u02c6r(L)\n\nt\n\nfor each slot\n\nt\n\nour analysis. While conditional submodular independence may seem ideal, we will show in our user\nstudy experiments that it is not required for our proposed algorithm to achieve good performance.\nEquations (4) and (5) imply that E[rt(A)] = F (A|w\u21e4) for F de\ufb01ned as in (1). Thus, E[rt] is\nmonotone submodular, and a clairvoyant system with perfect knowledge of w\u21e4 can greedily select\narticles to achieve (expected) reward at least (1 1/e)OP T , where OP T denotes the total expected\nreward of the optimal recommendations for t = 1, . . . , T . Let A\u21e4t denote the optimal set of articles\nat time t. We quantify performance using the following notion of regret which we call greedy regret,\n\nRegG(T ) =\u27131 \n\n1\n\ne\u25c6 TXt=1\n\nE [rt(A\u21e4t )] \n\nrt(At) \u2318\u27131 \n\n1\n\ne\u25c6 OP T \n\nTXt=1\n\nTXt=1\n\nrt(At).\n\n(6)\n\n4 Algorithm and Main Results\n\nA central question in the study of bandit problems is how best to balance the trade-off between\nexploration and exploitation (cf. [15]). To minimize regret (6), an algorithm must exploit its past\nexperience to recommend sets of articles that appear to maximize information coverage. However,\ntopics that appear good (i.e., interesting to the user) may actually be suboptimal due to impreci-\nsion in the algorithm\u2019s knowledge. In order to avoid this situation, the algorithm must explore by\nrecommending articles about seemingly poor topics in order to gather more information about them.\nIn this section, we present an algorithm, called LSBGREEDY, which automatically trades off be-\ntween exploration and exploitation (Algorithm 1). LSBGREEDY balances exploration and exploita-\ntion using upper con\ufb01dence bounds on the estimated gain in utility, and builds upon upper con\ufb01dence\nbound style algorithms for the conventional linear bandits setting [8, 20, 15, 7, 1]. Intuitively, the\nalgorithm can be decomposed into the following components.\nTraining a Model. Since we employ a linear model, at each time t, we can \ufb01t an estimate wt of the\ntrue w\u21e4 via linear regression on the previous feedback. Lines 3\u20135 in Algorithm 1 describe this step,\nwhere (`)\n\u2327 denotes the incremental coverage features of the article selected at time \u2327 and slot `, and\n\u02c6r(`)\n\u2327 denotes the associated reward. Note that in Line 3 is the standard regularization parameter.\nEstimating Incremental Coverage. Given wt, we can now estimate the incremental gain of adding\nany article a to an existing set of results A. As discussed in Section 3, the true (expected) incremental\ngain is (w\u21e4)> (a|A). Our algorithm\u2019s estimate is w>t (a|A) (Line 8). If our algorithm were to\npurely exploit prior knowledge, then it would greedily choose articles that maximize w>t (a|A).3\nComputing Con\ufb01dence Intervals. Of course, each wt is an imprecise estimate of the true w\u21e4.\nGiven such uncertainty, a natural approach is to use con\ufb01dence intervals which contain the true w\u21e4\n3Note that wt may have negative components, which would make F (\u00b7|wt) not monotone submodular. How-\never, regret is measured by F (\u00b7|w\u21e4), which is monotone submodular. We show in our analysis that having\nnegative components in wt does not hinder our ability to converge ef\ufb01ciently to w\u21e4 in a regret sense.\n\n4\n\n\ft A(1)\nt\na1\n1\n2\nb1\n\nr(1)\nt\n1\n1\n\nA(2)\nt\na2\nb3\n\nr(2)\nt\n0\n1\n\nFigure 1: Illustrative example of LSBGREEDY for L = 2 and 2 days. Each day comprises 3 articles\ncovering 4 topics, which are depicted in the two plots. Each row in the table describes the choices of\nLSBGREEDY and the resulting feedback. In day 1, LSBGREEDY recommends articles to explore\ntopics 1, 2, and 3, and the user indicates liking a1 and disliking a2. In day 2, LSBGREEDY recom-\nmends b1 to exploitatively cover topic 1, and b3 to both cover topic 1 and explore topic 4.\n\nwith some target con\ufb01dence (e.g., 95%). Our algorithm\u2019s uncertainty in the gain of article a given set\nA depends directly to how much feedback we have collected regarding prominent topics in (a|A).\nIn our linear setting, uncertainty is measured using the inverse covariance matrix M1\nof the sub-\nmodular features of the previously selected articles (Line 9). If our algorithm were to purely explore,\nthen it would greedily select articles that have maximal uncertaintyq(a|A)>M1\nBalancing Exploration and Exploitation. In order to achieve low regret, LSBGREEDY greedily\nselects articles that maximize a compromise between estimated gain and uncertainty (Line 10), with\n\u21b5t controlling the tradeoff. For any 2 (0, 1), Lemma 3 in Appendix A.2 provides suf\ufb01cient\nconditions on \u21b5t for constructing con\ufb01dence intervals,\n\nt (a|A).\n\nt\n\nw>t (a|A) \u00b1 \u21b5tq(a|A)>M1\n\n(7)\nthat contain the true value, (w\u21e4)>(a|A), with probability at least 1 . In this sense, Line 10\nmaximizes the upper con\ufb01dence bound on the true expected reward. Figure 1 provides an illustrative\nexample of the behavior of LSBGREEDY.\nWe now state our main result, which essentially bounds the greedy regret (6) of LSBGREEDY as\n\nt (a|A) \u2318 w>t (a|A) \u00b1 \u21b5tk(a|A)kM1\n\nO(dpT L) (ignoring log factors). This means that the average loss incurred per slot and per day by\nLSBGREEDY relative to (1 1/e)OP T decreases at a rate of O(d/pT L).\nTheorem 1. For L \uf8ff d, = L, and \u21b5t de\ufb01ned as\n\u21b5t =q2 log2 det(Mt)1/2 det(Id)1/2/ + Sp,\n\nwith probability at least 1 , LSBGREEDY achieves greedy regret (6) bounded by\n\n(8)\n\n,\n\nt\n\nRegG(T ) \uf8ff \u21b5Tp8T L log det(MT +1) +s2(1 + T L) log\u2713p1 + T L\n\n/2 \u25c6 = O\u2713SdpT L log\u2713 T L\n\n \u25c6\u25c6 .\n\nThe proof of Theorem 1 is presented in Appendix A in the supplementary material. In practice, the\nchoice of \u21b5t in (8) may be overly conservative. As we show in our experiments, more aggressive\nchoices of \u21b5t can often lead to faster convergence.\n5 Empirical Analysis: Case Study in News Recommendation\n\nWe applied LSBGREEDY to the setting of personalized news recommendation (cf. [9, 15, 16]),\nwhere the system is tasked with recommending sets of articles that maximally cover the interesting\ninformation of the available articles. The user provides feedback (e.g., by indicating that she likes\nor dislikes each article), and the goal is to maximize the total positive feedback by personalizing\nto the user. We conducted both simulation experiments as well as a live user study. Since real\nusers are unlikely to behave exactly according to our modeling assumptions (e.g., obey conditional\nsubmodular independence), our user study tests the effectiveness of our approach in settings beyond\nthose considered in our theoretical analysis.\n\n5.1 Simulations\n\nData. We ran simulations using both synthetic datasets as well as the blog dataset from [9]. For\neach setting, we generated a hidden true preference vector w\u21e4. For the synthetic data, all articles\n\n5\n\n\fFigure 2: Simulation results comparing LSBGREEDY (red), RankLinUCB (black thick), Multiplica-\ntive Weighting (black thin), and \u270f-Greedy (dashed thin). The middle column computes regret versus\nthe clairvoyant greedy solution, and not (1 1/e)OP T . Unless speci\ufb01ed, results are for L = 5.\n\nwere randomly generated using d = 25 topics, and w\u21e4 was randomly generated and re-scaled so the\nmost likely articles were liked with probability \u21e1 75%. For the blog dataset, articles are represented\nusing d = 100 topics generated using Latent Dirichlet Allocation [4], and w\u21e4 was derived from\na preliminary version of our user study. Our simulated user behaves according to the user model\ndescribed in Section 3. We use probabilistic coverage (2) as the submodular basis functions.\nCompeting Methods. We compared LSBGREEDY against the following online learning algo-\nrithms. Note that all learning algorithms use the same underlying submodular utility model.\n\n\u2022 Multiplicative Weighting (MW) as proposed in [9], which does not employ exploration.\n\u2022 RankLinUCB, which combines the LinUCB algorithm [8, 20, 15, 7, 1] with Ranked Bandits [18, 22].\nRankLinUCB is similar to LSBGREEDY except that it maintains a separate weight vector per slot\nsince it employs a reduction to L separate linear bandits (one per slot). In a sense, this is the natural\napplication of existing approaches to our setting.4\n\n\u2022 \u270f-Greedy, which randomly explores with probability \u270f, and exploits otherwise [15].\n\nResults. Figure 2 shows a representative sample of our simulation results.5 We see that both \u270f-\nGreedy and Multiplicative Weighting achieve signi\ufb01cantly worse results than LSBGREEDY. We\nalso observe the performance of Multiplcative Weigthing diverge in the synthetic dataset, which is\ndue to the fact that it does not employ exploration. RankLinUCB is more competitive, and achieves\nmatching performance in the synthetic dataset. We also see that RankLinUCB is more sensitive to\nthe choice of \u21b5. Interestingly, both LSBGREEDY and RankLinUCB approach the same performance\nwhen recommending L = 10 articles. This can be explained by the user\u2019s interests being saturated\nby 10 articles, and suggests that the bound in Theorem 1 could potentially be further re\ufb01ned. Addi-\ntional details can be found in Appendix B in the supplementary material.\n\n5.2 User Studies\n\nDesign. The design of our study is similar to the personalization study conducted in [9]. We pre-\nsented each user with ten articles per day over ten days from January 18, 2009 to January 27, 2009.\nEach day, the articles are selected using an interleaving of two policies (described below). The ar-\nticles are displayed as a title with its contents viewable via a preview pane. The user is instructed\n4One can show that RankLinUCB achieves greedy regret (6) that grows as O(dLpT ) (ignoring log factors),\nwhich is a factor pL worse than the regret guarantee of LSBGREEDY.\n5For all methods, we \ufb01nd performance to be relatively stable w.r.t.\n\nthe tuning parameters (e.g., \u21b5t for\nLSBGREEDY). Unless speci\ufb01ed, we set all parameters to values that achieve good results for their respective\nalgorithms. In particular we set \u21b5t = 1 for LSBGREEDY, \u21b5t = 0.6 for RankLinUCB, = 0.9 for MW,\nand \u270f = 0.1 for \u270f-Greedy. LSBGREEDY, RankLinUCB, and \u270f-Greedy train linear models with regularization\nparameter , which we kept constant at = 1.\n\n6\n\n\fFigure 3: Displaying normalized learned preferences of LSBGREEDY (dark) and MW (light) for\ntwo user study sessions. In the left session, MW over\ufb01ts to the \u201cworld\u201d topic. In the right session,\nthe user likes very few articles, and MW does not discover any topics that interest the user.\n\nCOMPARISON\nLSBGREEDY vs Static Baseline\nLSBGREEDY vs Mult. Weighting\nLSBGREEDY vs RankLinUCB\n\n24\n26\n27\n\n#SESSIONS WIN/TIE/LOSE GAIN PER DAY % OF LIKES\n63% (67%)\n57% (63%)\n57% (61%)\n\n24 / 0 / 0\n24 / 1 / 1\n21 / 2 / 4\n\n1.07\n0.54\n0.58\n\nTable 1: User study comparing LSBGREEDY with competing algorithms. The parenthetical values\nin the last column are computed ignoring clicks on articles jointly recommended by both algorithms\n(see Section 5.2). All results are statistically signi\ufb01cant with 95% con\ufb01dence.\n\nto brie\ufb02y skim each article to get a sense of its content and, one by one, mark each article as \u201cinter-\nested in reading in detail\u201d (like), or \u201cnot interested\u201d (dislike). As in [9], for each decision, the user\nis told to take into account the articles shown above in the current day, so as to capture the notion\nof incremental coverage. For example, a user might be interested in reading an article regarding the\nMiddle East appearing at the top slot, and would mark it as \u201cinterested.\u201d However, if several very\nsimilar articles appear below it, the user may mark the subsequent articles as \u201cnot interested.\u201d\nEvaluation. For each day, we generate an interleaving of recommendations from two algorithms.\nInterleaving allows us to make paired comparisons such that we simultaneously control for the\nparticular user and particular day (certain days may contain more or less interesting content to the\nuser than other days). Like other interleaving approaches [19], our approach maintains a notion of\nfairness so that both competing algorithms recommend the same amount of content. After each day,\nthe user\u2019s feedback is collected and given to the two competing algorithms. Additional details of\nour experimental setup can be found in Appendix C in the supplementary material.\nData. In order to distinguish the gains of the algorithms from other effects (such as imperfections in\nthe features, or having too high a dimension to converge), we performed dimensionality reduction.\nWe created 18 genres (examples shown in Figure 3), labeled relevant articles and trained a model\nvia linear regression for each genre. Note that many articles are relevant to multiple genres.\nWe compared LSBGREEDY against the static baseline (i.e., no personalization), Multiplicative\nWeighting (MW) from [9], and RankLinUCB. We evaluated each comparison setting using approx-\nimately twenty \ufb01ve participants, most of whom are graduate students or young professionals.\nResults. Table 1 describes our results. We \ufb01rst aggregated per user, and then aggregated over all\nusers. For each user, we computed three statistics: (1) whether LSBGREEDY won, tied, or lost in\nterms of total number of liked articles, (2) the difference in liked articles per day, and (3) the fraction\nof liked articles recommended by LSBGREEDY. Jointly recommended articles can be either counted\nas half to each algorithm or ignored (these results are shown in parentheticals in Table 1).\nOverall, about 90% of users preferred recommendations by LSBGREEDY over the competing al-\ngorithms. On average, LSBGREEDY obtains about one additional liked article per day and 63%\nof all liked articles versus the static baseline, and about half an additional liked article per day and\n57% of all liked articles versus the two competing learning algorithms. The gains we observe are\nall statistically signi\ufb01cant with 95% con\ufb01dence, and show that LSBGREEDY can be effective even\nwhen the assumptions in our theoretical analysis may not be satis\ufb01ed.\nFigure 3 shows the learned preferences by LSBGREEDY and MW on two sessions. Since MW does\nnot employ exploration, it can either over\ufb01t to its previous experience and not \ufb01nd new topics that\ninterest the user (left plot), or fail to discover any good topics (right plot). We do not include a\ncomparison with RankLinUCB since it learns L preference vectors, which are dif\ufb01cult to visualize.\n\n7\n\n\f6 Related Work\n\nDiversi\ufb01ed Retrieval. We are chie\ufb02y interested in training \ufb02exible submodular utility models,\nsince such models yield practical algorithmic approaches. At one extreme are feature-free models\nthat do not require training. However, such models are limited to unpersonalized settings that ignore\ncontext, such as recommending a global set of blogs to monitor [14]. On the other hand, methods that\nuse feature-rich models typically either employ unsupervised training [24] or require \ufb01ne-grained\nsubtopic labels [25]. Such learning approaches cannot easily adapt to new domains. One exception\nis [9], whose proposed online learning approach does not incorporate exploration. As shown in our\nexperiments, this signi\ufb01cantly inhibits the learning ability of their approach.\nBeyond submodular models of information coverage, other approaches include methods that balance\nrelevance and novelty [5, 26, 6] and graph-based methods [27]. For such models, it remains a\nchallenge to design provably ef\ufb01cient online learning algorithms.\nBandit Learning. From the perspective of our work, existing bandit approaches can be categorized\nalong two dimensions: single-prediction versus set-prediction, and feature-based versus feature-free.\nMost feature-based settings are designed to predict single results, rather than sets of results. Of such\nsettings, the most relevant to ours is the linear stochastic bandits setting [8, 20, 15, 7, 1], which\nwe build upon in our approach. One limitation here is the assumption of realizability \u2013 that the\n\u201ctrue\u201d user model lies within our class. It may be possible to develop more robust algorithms for our\nsubmodular bandits setting by building upon algorithms with more general guarantees (e.g., [2]).\nMost set-based settings, such as bandit submodular optimization or the general bandit slate problem,\nassume a feature-free model [18, 22, 23, 12]. As such, performance is quanti\ufb01ed relative to a \ufb01xed\nset of articles, which is not appropriate for many retrieval settings (e.g., news recommendation).\nOne exception is [21], which assumes that document and user models lie within a metric space.\nHowever, it is unclear how to incorporate our submodular features into their setting.\n\n7 Discussion of Limitations and Future Work\n\nSubmodular Basis Features. Our approach requires access to submodular basis functions as fea-\ntures.\nIn practice these basis features are often derived using various topic modeling or dimen-\nsionality reduction techniques. However, the resulting features are almost always noisy or biased.\nFurthermore, one expects that different users will be better modeled using different basis features.\nAs such, one important direction for future work is to learn the appropriate basis features from user\nfeedback, which is similar to the setting of interactive topic modeling [11].\nMoreover, user behavior is likely to be in\ufb02uenced by many factors beyond those well-modeled by\nsubmodular basis features. For example, the probability of the user liking a certain article could be\nin\ufb02uenced by the time of day, or day of the week. A more uni\ufb01ed approach would be to incorporate\nboth these standard features as well as submodular basis features in a joint model.\nCurse of Dimensionality. The convergence rate of LSBGREEDY depends linearly on the number\nof features d (which appears unavoidable without further assumptions). Thus, our approach may not\nbe practical for settings that use a very large number of features. One possible extension is to jointly\nlearn from multiple users simultaneously. If users tend to have similar preferences, then learning\njointly from multiple users may yield convergence rates that are sub-linear in d.\n\n8 Conclusion\n\nWe proposed an online learning setting for optimizing a general class of submodular functions.\nThis setting is well-suited for modeling diversi\ufb01ed retrieval systems that interactively learn from\nuser feedback. We presented an algorithm, LSBGREEDY, and proved that it ef\ufb01ciently converges\nto a near-optimal model. We conducted simulations as well as user studies in the setting of news\nrecommendation, and found that LSBGREEDY outperforms competing online learning approaches.\nAcknowledgements. This work was funded in part by ONR (PECASE) N000141010672 and ONR Young\nInvestigator Program N00014-08-1-0752. The authors also thank Khalid El-Arini, Joey Gonzalez, Sue Ann\nHong, Jing Xiang, and the anonymous reviewers for their helpful comments.\n\n8\n\n\fReferences\n[1] Y. Abbasi-Yadkori, D. Pal, and C. Szepesvari. Online least squares estimation with self-normalized pro-\n\ncesses: An application to bandit problems, 2011. http://arxiv.org/abs/1102.2670.\n\n[2] J. Abernathy, E. Hazan, and A. Rakhlin. Competing in the dark: An ef\ufb01cient algorithm for bandit linear\n\noptimization. In Conference on Learning Theory (COLT), 2008.\n\n[3] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In ACM Conference\n\non Web Search and Data Mining (WSDM), 2009.\n\n[4] D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research\n\n(JMLR), 3:993\u20131022, 2003.\n\n[5] J. Carbonell and J. Goldstein. The use of MMR, diversity-based re-ranking for reordering documents and\n\nproducing summaries. In ACM Conference on Information Retrieval (SIGIR), 1998.\n\n[6] H. Chen and D. Karger. Less is more. In ACM Conference on Information Retrieval (SIGIR), 2006.\n[7] W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In Conference\n\non Arti\ufb01cial Intelligence and Statistics (AISTATS), 2011.\n\n[8] V. Dani, T. Hayes, and S. Kakade. Stochastic linear optimization under bandit feedback. In Conference\n\non Learning Theory (COLT), 2008.\n\n[9] K. El-Arini, G. Veda, D. Shahaf, and C. Guestrin. Turning down the noise in the blogosphere. In ACM\n\nConference on Knowledge Discovery and Data Mining (KDD), 2009.\n\n[10] U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45(4):634\u2013652,\n\n1998.\n\n[11] Y. Hu, J. Boyd-Graber, and B. Satinoff. Interactive topic modeling. In Annual Meeting of the Association\n\nfor Computational Linguistics (ACL), 2011.\n\n[12] S. Kale, L. Reyzin, and R. Schapire. Non-stochastic bandit slate problems. In Neural Information Pro-\n\ncessing Systems (NIPS), 2010.\n\n[13] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Neural\n\nInformation Processing Systems (NIPS), 2007.\n\n[14] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak\n\ndetection in networks. In ACM Conference on Knowledge Discovery and Data Mining (KDD), 2007.\n\n[15] L. Li, W. Chu, J. Langford, and R. Schapire. A contextual-bandit approach to personalized news article\n\nrecommendation. In World Wide Web Conference (WWW), 2010.\n\n[16] L. Li, D. Wang, T. Li, D. Knox, and B. Padmanabhan. Scene: A scalable two-stage personalized news\n\nrecommendation system. In ACM Conference on Information Retrieval (SIGIR), 2011.\n\n[17] G. Nemhauser, L. Wolsey, and M. Fisher. An analysis of the approximations for maximizing submodular\n\nset functions. Mathematical Programming, 14:265\u2013294, 1978.\n\n[18] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits.\n\nInternational Conference on Machine Learning (ICML), 2008.\n\nIn\n\n[19] F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data re\ufb02ect retrieval quality? In ACM\n\nConference on Information and Knowledge Management (CIKM), 2008.\n\n[20] P. Rusmevichientong and J. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations Re-\n\nsearch, 35(2):395\u2013411, 2010.\n\n[21] A. Slivkins, F. Radlinski, and S. Gollapudi. Learning optimally diverse rankings over large document\n\ncollections. In International Conference on Machine Learning (ICML), 2010.\n\n[22] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions.\n\nInformation Processing Systems (NIPS), 2008.\n\nIn Neural\n\n[23] M. Streeter, D. Golovin, and A. Krause. Online learning of assignments. In Neural Information Process-\n\ning Systems (NIPS), 2009.\n\n[24] A. Swaminathan, C. Mathew, and D. Kirovski. Essential pages. In The IEEE/WIC/ACM International\n\nConference on Web Intelligence (WI), 2009.\n\n[25] Y. Yue and T. Joachims. Predicting diverse subsets using structural svms. In International Conference on\n\nMachine Learning (ICML), 2008.\n\n[26] C. Zhai, W. W. Cohen, and J. Lafferty. Beyond independent relevance: methods and evaluation metrics\n\nfor subtopic retrieval. In ACM Conference on Information Retrieval (SIGIR), 2003.\n\n[27] X. Zhu, A. Goldberg, J. V. Gael, and D. Andrzejewski. Improving diversity in ranking using absorbing\n\nrandom walks. In NAACL Conference on Human Language Technologies (HLT), 2007.\n\n9\n\n\f", "award": [], "sourceid": 1341, "authors": [{"given_name": "Yisong", "family_name": "Yue", "institution": null}, {"given_name": "Carlos", "family_name": "Guestrin", "institution": null}]}