{"title": "Online Learning of Assignments", "book": "Advances in Neural Information Processing Systems", "page_first": 1794, "page_last": 1802, "abstract": "Which ads should we display in sponsored search in order to maximize our revenue? How should we dynamically rank information sources to maximize value of information? These applications exhibit strong diminishing returns: Selection of redundant ads and information sources decreases their marginal utility. We show that these and other problems can be formalized as repeatedly selecting an assignment of items to positions to maximize a sequence of monotone submodular functions that arrive one by one. We present an efficient algorithm for this general problem and analyze it in the no-regret model. Our algorithm is equipped with strong theoretical guarantees, with a performance ratio that converges to the optimal constant of 1-1/e. We empirically evaluate our algorithms on two real-world online optimization problems on the web: ad allocation with submodular utilities, and dynamically ranking blogs to detect information cascades.", "full_text": "Online Learning of Assignments\n\nMatthew Streeter\n\nGoogle, Inc.\n\nPittsburgh, PA 15213\n\nDaniel Golovin\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAndreas Krause\n\nCalifornia Institute of Technology\n\nPasadena, CA 91125\n\nmstreeter@google.com\n\ndgolovin@cs.cmu.edu\n\nkrausea@caltech.edu\n\nAbstract\n\nWhich ads should we display in sponsored search in order to maximize our revenue?\nHow should we dynamically rank information sources to maximize the value of\nthe ranking? These applications exhibit strong diminishing returns: Redundancy\ndecreases the marginal utility of each ad or information source. We show that\nthese and other problems can be formalized as repeatedly selecting an assignment\nof items to positions to maximize a sequence of monotone submodular functions\nthat arrive one by one. We present an ef\ufb01cient algorithm for this general problem\nand analyze it in the no-regret model. Our algorithm possesses strong theoretical\nguarantees, such as a performance ratio that converges to the optimal constant\nof 1 \u2212 1/e. We empirically evaluate our algorithm on two real-world online\noptimization problems on the web: ad allocation with submodular utilities, and\ndynamically ranking blogs to detect information cascades.\n\n1 Introduction\nConsider the problem of repeatedly choosing advertisements to display in sponsored search to\nmaximize our revenue. In this problem, there is a small set of positions on the page, and each time\na query arrives we would like to assign, to each position, one out of a large number of possible\nads. In this and related problems that we call online assignment learning problems, there is a set\nof positions, a set of items, and a sequence of rounds, and on each round we must assign an item\nto each position. After each round, we obtain some reward depending on the selected assignment,\nand we observe the value of the reward. When there is only one position, this problem becomes\nthe well-studied multiarmed bandit problem [2]. When the positions have a linear ordering the\nassignment can be construed as a ranked list of elements, and the problem becomes one of selecting\nlists online. Online assignment learning thus models a central challenge in web search, sponsored\nsearch, news aggregators, and recommendation systems, among other applications.\nA common assumption made in previous work on these problems is that the quality of an assignment\nis the sum of a function on the (item, position) pairs in the assignment. For example, online advertising\nmodels with click-through-rates [6] make an assumption of this form. More recently, there have been\nattempts to incorporate the value of diversity in the reward function [16]. Intuitively, even though the\nbest K results for the query \u201cturkey\u201d might happen to be about the country, the best list of K results\nis likely to contain some recipes for the bird as well. This will be the case if there are diminishing\nreturns on the number of relevant links presented to a user; for example, if it is better to present each\nuser with at least one relevant result than to present half of the users with no relevant results and\nhalf with two relevant results. We incorporate these considerations in a \ufb02exible way by providing\nan algorithm that performs well whenever the reward for an assignment is a monotone submodular\nfunction of its set of (item, position) pairs.\nOur key contributions are: (i) an ef\ufb01cient algorithm, TABULARGREEDY, that provides a (1 \u2212 1/e)\napproximation ratio for the problem of optimizing assignments under submodular utility functions, (ii)\nan algorithm for online learning of assignments, TGBANDIT, that has strong performance guarantees\nin the no-regret model, and (iii) an empirical evaluation on two problems of information gathering on\nthe web.\n\n1\n\n\f2 The assignment learning problem\nWe consider problems, where we have K positions (e.g., slots for displaying ads), and need to assign\nto each position an item (e.g., an ad) in order to maximize a utility function (e.g., the revenue from\nclicks on the ads). We address both the of\ufb02ine problem, where the utility function is speci\ufb01ed in\nadvance, and the online problem, where a sequence of utility functions arrives over time, and we need\nto repeatedly select a new assignment.\n\nThe Of\ufb02ine Problem.\nIn the of\ufb02ine problem we are given sets P1, P2, . . . , PK, where Pk is the\nset of items that may be placed in position k. We assume without loss of generality that these sets\nare disjoint.1 An assignment is a subset S \u2286 V, where V = P1 \u222a P2 \u222a \u00b7\u00b7\u00b7 \u222a PK is the set of all\nitems. We call an assignment feasible, if at most one item is assigned to each position (i.e., for all k,\n|S \u2229 Pk| \u2264 1). We use P to refer to the set of feasible assignments.\nOur goal is to \ufb01nd a feasible assignment maximizing a utility function f : 2V \u2192 R\u22650. As we discuss\nlater, many important assignment problems satisfy submodularity, a natural diminishing returns\nproperty: Assigning a new item to a position k increases the utility more if few elements have been\nassigned yet, and less if many items have already been assigned. Formally, a utility function f is\ncalled submodular, if for all S \u2286 S(cid:48) and s /\u2208 S(cid:48) it holds that f(S\u222a{s})\u2212f(S) \u2265 f(S(cid:48)\u222a{s})\u2212f(S(cid:48)).\nWe will also assume f is monotone (i.e., for all S \u2286 S(cid:48), we have f(S) \u2264 f(S(cid:48))). Our goal is thus,\nfor a given non-negative, monotone and submodular utility function f, to \ufb01nd a feasible assignment\nS\u2217 of maximum utility, S\u2217 = arg maxS\u2208P {f(S)}.\nThis optimization problem is NP-hard. In fact, a stronger negative result holds:\nTheorem 1 ([14]). For any \u0001 > 0, any algorithm guaranteed to obtain a solution within a factor of\n(1 \u2212 1/e + \u0001) of maxS\u2208P {f(S)} requires exponentially many evaluations of f in the worst case.\n\nIn light of this negative result, we can only hope to ef\ufb01ciently obtain a solution that achieves a fraction\nof (1 \u2212 1/e) of the optimal value. In \u00a73.2 we develop such an algorithm.\nThe Online Problem. The of\ufb02ine problem is inappropriate to model dynamic settings, where the\nutility function may change over time, and we need to repeatedly select new assignments, trading\noff exploration (experimenting with ad display to gain information about the utility function), and\nexploitation (displaying ads which we believe will maximize utility). More formally, we face a\nsequential decision problem, where, on each round (which, e.g., corresponds to a user query for a\nparticular term), we want to select an assignment St (ads to display). We assume that the sets P1, P2,\n. . . , PK are \ufb01xed in advance for all rounds. After we select the assignment we obtain reward ft(St)\nfor some non-negative monotone submodular utility function ft. We call the setting where we do\nnot get any information about ft beyond the reward the bandit feedback model. In contrast, in the\nfull-information feedback model we obtain oracle access to ft (i.e., we can evaluate ft on arbitrary\nfeasible assignments). Both models arise in real applications, as we show in \u00a75.\n\nThe goal is to maximize the total reward we obtain, namely(cid:80)\n\nt ft(St). Following the multiarmed\nbandit literature, we evaluate our performance after T rounds by comparing our total reward against\nthat obtained by a clairvoyant algorithm with knowledge of the sequence of functions (cid:104)f1, . . . , fT(cid:105),\nbut with the restriction that it must select the same assignment on each round. The difference between\nthe clairvoyant algorithm\u2019s total reward and ours is called our regret. The goal is then to develop an\nalgorithm whose expected regret grows sublinearly in the number of rounds; such an algorithm is\nsaid to have (or be) no-regret. However, since sums of submodular functions remain submodular, the\nt ft(S). Considering\nTheorem 1, no polynomial-time algorithm can possibly hope to achieve a no-regret guarantee. To\naccommodate this fact, we discount the reward of the clairvoyant algorithm by a factor of (1 \u2212 1/e):\nWe de\ufb01ne the (1 \u2212 1/e)-regret of a random sequence (cid:104)S1, . . . , ST(cid:105) as\n\nclairvoyant algorithm has to solve an of\ufb02ine assignment problem with f(S) =(cid:80)\n(cid:35)\n\n(cid:41)\n\n(cid:18)\n\n(cid:19)\n\n(cid:40) T(cid:88)\n\n(cid:34) T(cid:88)\n\n1 \u2212 1\ne\n\n\u00b7 max\nS\u2208P\n\nft(S)\n\n\u2212 E\n\nft(St)\n\n.\n\nt=1\n\nt=1\n\nOur goal is then to develop ef\ufb01cient algorithms whose (1 \u2212 1/e)-regret grows sublinearly in T .\n\n1If the same item can be placed in multiple positions, simply create multiple distinct copies of it.\n\n2\n\n\fof these cases, the (expected) reward of a set S of (ad, position) pairs is(cid:80)\n\nSubsumed Models. Our model generalizes several common models for sponsored search ad\nselection, and web search results. These include models with click-through-rates, in which it is\nassumed that each (ad, position) pair has some probability p(a, k) of being clicked on, and there is\nsome monetary reward b(a) that is obtained whenever ad a is clicked on. Often, the click-through-\nrates are assumed to be separable, meaning p(a, k) has the functional form \u03b1(a) \u00b7 \u03b2(k) for some\nfunctions \u03b1 and \u03b2. See [7, 12] for more details on sponsored search ad allocation. Note that in both\n(a,k)\u2208S g(a, k) for some\nnonnegative function g. It is easy to verify that such a reward function is monotone submodular.\nThus, we can capture this model in our framework by setting Pk = A \u00d7 {k}, where A is the set of\nads. Another subsumed model, for web search, appears in [16]; it assumes that each user is interested\nin a particular set of results, and any list of results that intersects this set generates a unit of value;\nall other lists generate no value, and the ordering of results is irrelevant. Again, the reward function\nis monotone submodular. In this setting, it is desirable to display a diverse set of results in order to\nmaximize the likelihood that at least one of them will interest the user.\nOur model is \ufb02exible in that we can handle position-dependent effects and diversity considerations\nsimultaneously. For example, we can handle the case that each user u is interested in a particular\nset Au of ads and looks at a set Iu of positions, and the reward of an assignment S is any monotone-\nincreasing concave function g of |S \u2229 (Au \u00d7 Iu)|. If Iu = {1, 2, . . . , k} and g(x) = x, this models\nthe case where the quality is the number of relevant result that appear in the \ufb01rst k positions. If Iu\nequals all positions and g(x) = min{x, 1} we recover the model of [16].\n3 An approximation algorithm for the of\ufb02ine problem\n3.1 The locally greedy algorithm\nA simple approach to the assignment problem is the following greedy procedure: the algorithm\nsteps through all K positions (according to some \ufb01xed, arbitrary ordering). For position k, it simply\nchooses the item that increases the total value as much as possible, i.e., it chooses\n\nsk = arg max\n\ns\u2208Pk\n\n{f({s1, . . . , sk\u22121} + s)} ,\n\nwhere, for a set S and element e, we write S + e for S \u222a {e}. Perhaps surprisingly, no matter which\nordering over the positions is chosen, this so-called locally greedy algorithm produces an assignment\nthat obtains at least half the optimal value [8]. In fact, the following more general result holds. We\nwill use this lemma in the analysis of our improved of\ufb02ine algorithm, which uses the locally greedy\nalgorithm as a subroutine.\n\nLemma 2. Suppose f : 2V \u2192 R\u22650 is of the form f(S) = f0(S) +(cid:80)K\n\nk=1 fk(S \u2229 Pk) where\nf0 : 2V \u2192 R\u22650 is monotone submodular, and fk : 2Pk \u2192 R\u22650 is arbitrary for k \u2265 1. Let L be the\nsolution returned by the locally greedy algorithm. Then f(L) + f0(L) \u2265 maxS\u2208P {f(S)}.\nThe proof is given in an extended version of this paper [9]. Observe that in the special case where\nfk \u2261 0 for all k \u2265 1, Lemma 2 says that f(L) \u2265 1\n2 maxS\u2208P f(S). In [9] we provide a simple\nexample showing that this 1/2 approximation ratio is tight.\n\nset S \u2286 V \u00d7 [C] and vector (cid:126)c = (c1, . . . , cK), de\ufb01ne sample(cid:126)c(S) =(cid:83)K\n\n3.2 An algorithm with optimal approximation ratio\nWe now present an algorithm that achieves the optimal approximation ratio of 1 \u2212 1/e, improving on\n2 approximation for the locally greedy algorithm. Our algorithm associates with each partition\nthe 1\nPk a color ck from a palette [C] of C colors, where we use the notation [n] = {1, 2, . . . , n}. For any\nk=1 {x \u2208 Pk : (x, ck) \u2208 S}.\nGiven a set S of (item, color) pairs, which we may think of as labeling each item with one or more\ncolors, sample(cid:126)c(S) returns a set containing each item x that is labeled with whatever color (cid:126)c assigns\nto the partition that contains x. Let F (S) denote the expected value of f(sample(cid:126)c(S)) when each\ncolor ck is selected uniformly at random from [C]. Our TABULARGREEDY algorithm greedily\noptimizes F , as shown in the following pseudocode.\nObserve that when C = 1, there is only one possible choice for (cid:126)c, and TABULARGREEDY is\nsimply the locally greedy algorithm from \u00a73.1. In the limit as C \u2192 \u221e, TABULARGREEDY can\nintuitively be viewed as an algorithm for a continuous extension of the problem followed by a\n\n3\n\n\fAlgorithm: TABULARGREEDY\n\nInput: integer C, sets P1, P2, . . . , PK, function f : 2V \u2192 R\u22650 (where V =(cid:83)K\n\nk=1 Pk)\n\nset G := \u2205.\nfor c from 1 to C do\n\nfor k from 1 to K do\n\nset gk,c = arg maxx\u2208Pk\u00d7{c} {F (G + x)}\nset G := G + gk,c;\n\nfor each k \u2208 [K], choose ck uniformly at random from [C].\nreturn sample(cid:126)c(G), where (cid:126)c := (c1, . . . , cK).\n\n/* For each color\n*/\n/* For each partition */\n/* Greedily pick gk,c\n*/\n\nrounding procedure, in the same spirit as Vondr\u00b4ak\u2019s continuous-greedy algorithm [4]. In our case, the\ncontinuous extension is to compute a probability distribution Dk for each position k with support in\nPk (plus a special \u201cselect nothing\u201d outcome), such that if we independently sample an element xk\nfrom Dk, E [f({x1, . . . , xK})] is maximized. It turns out that if the positions individually, greedily,\nand in round-robin fashion, add in\ufb01nitesimal units of probability mass to their distributions so as to\nmaximize this objective function, they achieve the same objective function value as if, rather than\nmaking decisions in a round-robin fashion, they had cooperated and added the combination of K\nin\ufb01nitesimal probability mass units (one per position) that greedily maximizes the objective function.\nThe latter process, in turn, can be shown to be equivalent to a greedy algorithm for maximizing a\n(different) submodular function subject to a cardinality constraint, which implies that it achieves\na 1 \u2212 1/e approximation ratio [15]. TABULARGREEDY represents a tradeoff between these two\nextremes; its performance is summarized by the following theorem. For now, we assume that the\narg max in the inner loop is computed exactly. In the extended version [9], we bound the performance\nloss that results from approximating the arg max (e.g., by estimating F by repeated sampling).\nTheorem 3. Suppose f is monotone submodular. Then F (G) \u2265 \u03b2(K, C) \u00b7 maxS\u2208P {f(S)}, where\n\u03b2(K, C) is de\ufb01ned as 1 \u2212 (1 \u2212 1\nIt follows that, for any \u03b5 > 0, TABULARGREEDY achieves a (1 \u2212 1/e \u2212 \u03b5) approximation factor\nusing a number of colors that is polynomial in K and 1/\u03b5. The theorem will follow immediately\nfrom the combination of two key lemmas, which we now prove. Informally, Lemma 4 analyzes the\napproximation error due to the outer greedy loop of the algorithm, while Lemma 5 analyzes the\napproximation error due to the inner loop.\nLemma 4. Let Gc = {g1,c, g2,c, . . . , gK,c}, and let G\u2212\nc = G1 \u222a G2 \u222a . . . \u222a Gc\u22121. For each\nc \u222a x)} \u2212 Ec where Rc :=\ncolor c, choose Ec \u2208 R such that F (G\u2212\n{R : \u2200k \u2208 [K] ,|R \u2229 (Pk \u00d7 {c})| = 1} is the set of all possible choices for Gc. Then\n\nc \u222a Gc) \u2265 maxx\u2208Rc {F (G\u2212\n\nC )C \u2212(cid:0)K\n\n(cid:1)C\u22121.\n\n2\n\nF (G) \u2265 \u03b2(C) \u00b7 max\n\nEc .\n\n(3.1)\n\nS\u2208P {f(S)} \u2212 C(cid:88)\n\nc=1\n\n(cid:1)C.\n\nwhere \u03b2(C) = 1 \u2212(cid:0)1 \u2212 1\nR[C] := (cid:83)C\nH(R) = F(cid:0)(cid:83)\n\nC\n\nR\u2208R R(cid:1). We will prove the lemma in three steps: (i) H is monotone submodular, (ii)\n\nProof (Sketch). We will refer to an element R of Rc as a row, and to c as the color of the row. Let\nc=1 Rc be the set of all rows. Consider the function H : 2R[C] \u2192 R\u22650, de\ufb01ned as\nTABULARGREEDY is simply the locally greedy algorithm for \ufb01nding a set of C rows that maximizes\nH, where the cth greedy step is performed with additive error Ec, and (iii) TABULARGREEDY obtains\nthe guarantee (3.1) for maximizing H, and this implies the same ratio for maximizing F .\nTo show that H is monotone submodular, it suf\ufb01ces to show that F is monotone submodular.\nBecause F (S) = E(cid:126)c [f(sample(cid:126)c(S))], and because a convex combination of monotone submodular\nfunctions is monotone submodular, it suf\ufb01ces to show that for any particular coloring (cid:126)c, the function\nf(sample(cid:126)c(S)) is monotone submodular. This follows from the de\ufb01nition of sample and the fact\nthat f is monotone submodular.\nThe second claim is true by inspection. To prove the third claim, we note that the row colors for a set\nof rows R can be interchanged with no effect on H(R). For problems with this special property, it is\n\n4\n\n\fknown that the locally greedy algorithm obtains an approximation ratio of \u03b2(C) = 1\u2212(1\u2212 1\nTheorem 6 of [17] extends this result to handle additive error, and yields\n\nC )C [15].\n\nF (G) = H({G1, G2, . . . , GC}) \u2265 \u03b2(C) \u00b7\n\nmax\n\nR\u2286R[C]:|R|\u2264C\n\nTo complete the proof, it suf\ufb01ces to show that maxR\u2286R[C]:|R|\u2264C {H(R)} \u2265 maxS\u2208P {f(S)}. This\nfollows from the fact that for any assignment S \u2208 P, we can \ufb01nd a set R(S) of C rows such that\n\nR\u2208R(S) R) = S with probability 1, and therefore H(R(S)) = f(S).\n\nsample(cid:126)c((cid:83)\n\nc=1\n\n{H(R)} \u2212 C(cid:88)\n\nEc .\n\n2\n\n(cid:1)C\u22122f\u2217.\n\nc \u222a R)} \u2212(cid:0)K\n\nc \u222aR))\u2212f(sample(cid:126)c(G\u2212\n\nc )), and let Fc(R) := F (G\u2212\n\nc \u222a Gc) \u2265 maxR\u2208Rc {F (G\u2212\n\nc , and Rc be de\ufb01ned as in the statement of\n\nWe now bound the performance of the the inner loop of TABULARGREEDY.\nLemma 5. Let f\u2217 = maxS\u2208P {f(S)}, and let Gc, G\u2212\nLemma 4. Then, for any c \u2208 [C], F (G\u2212\nProof (Sketch). Let N denote the number of partitions whose color (assigned by (cid:126)c) is c. For R \u2208 Rc,\nlet \u2206(cid:126)c(R) := f(sample(cid:126)c(G\u2212\nc ). By\nde\ufb01nition, Fc(R) = E(cid:126)c [\u2206(cid:126)c(R)] = P [N = 1] E(cid:126)c [\u2206(cid:126)c(R)|N = 1] + P [N \u2265 2] E(cid:126)c [\u2206(cid:126)c(R)|N \u2265 2],\nwhere we have used the fact that \u2206(cid:126)c(R) = 0 when N = 0. The idea of the proof is that\nthe \ufb01rst of these terms dominates as C \u2192 \u221e, and that E(cid:126)c [\u2206(cid:126)c(R)|N = 1] can be optimized\nexactly simply by optimizing each element of Pk \u00d7 {c} independently. Speci\ufb01cally, it can\nk=1 fk(R \u2229 (Pk \u00d7 {c})) for suitable fk. Additionally,\nf0(R) = P [N \u2265 2] E(cid:126)c [\u2206(cid:126)c(R)|N \u2265 2] is a monotone submodular function of a set of (item,\ncolor) pairs, for the same reasons F is. Applying Lemma 2 with these {fk : k \u2265 0} yields\nFc(Gc) + P [N \u2265 2] E(cid:126)c [\u2206(cid:126)c(Gc)|N \u2265 2] \u2265 maxR\u2208Rc {Fc(R)}. To complete the proof, it suf-\n\nbe seen that E(cid:126)c [\u2206(cid:126)c(R)|N = 1] = (cid:80)K\n\ufb01ces to show P [N \u2265 2] \u2264 (cid:0)K\nP [N \u2265 2] = P [M \u2265 1] \u2264 E [M] =(cid:0)K\n\n(cid:1)C\u22122 and E(cid:126)c [\u2206(cid:126)c(Gc)|N \u2265 2] \u2264 f\u2217. The \ufb01rst inequality holds\n(cid:1)C\u22122. The second inequality follows from the fact that for\n\nbecause, if we let M be the number of pairs of partitions that are both assigned color c, we have\nany (cid:126)c we have \u2206(cid:126)c(Gc) \u2264 f(sample(cid:126)c(G\u2212\n\nc \u222a Gc)) \u2264 f\u2217.\n\nc \u222aR)\u2212F (G\u2212\n\n2\n\n2\n\n4 An algorithm for online learning of assignments\nWe now transform the of\ufb02ine algorithm of \u00a73.2 into an online algorithm. The high-level idea behind\nthis transformation is to replace each greedy decision made by the of\ufb02ine algorithm with a no-regret\nonline algorithm. A similar approach was used in [16] and [18] to obtain an online algorithm for\ndifferent (simpler) online problems.\n\nAlgorithm: TGBANDIT (described in the full-information feedback model)\nInput: integer C, sets P1, P2, . . . , PK\nfor each k \u2208 [K] and c \u2208 [C], let Ek,c be a no-regret algorithm with action set Pk \u00d7 {c}.\nfor t from 1 to T do\n\n(cid:16)(cid:110)\n\nk,c \u2208 Pk \u00d7 {c} be the action selected by Ek,c\nk,c : k \u2208 [K] , c \u2208 [C]\ngt\n\nfor each k \u2208 [K] and c \u2208 [C], let gt\nfor each k \u2208 [K], choose ck uniformly at random from [C]. De\ufb01ne (cid:126)c = (c1, . . . , cK).\nselect the set Gt = sample(cid:126)c\nobserve ft, and let \u00afFt(S) := ft(sample(cid:126)c(S))\nfor each k \u2208 [K], c \u2208 [C] do\nk(cid:48),c(cid:48) : k(cid:48) \u2208 [K] , c(cid:48) < c\nde\ufb01ne Gt\u2212\ngt\nfor each x \u2208 Pk \u00d7 {c}, feed back \u00afFt(Gt\u2212\n\nk,c + x) to Ek,c as the reward for choosing x\n\nk,c \u2261(cid:110)\n\n(cid:111) \u222a(cid:110)\n\nk(cid:48),c : k(cid:48) < k\ngt\n\n(cid:111)(cid:17)\n\n(cid:111)\n\nThe following theorem summarizes the performance of TGBANDIT.\n\nTheorem 6. Let rk,c be the regret of Ek,c, and let \u03b2(K, C) = 1 \u2212(cid:0)1 \u2212 1\n(cid:34) K(cid:88)\n\n(cid:40) T(cid:88)\n\n(cid:34) T(cid:88)\n\n(cid:41)\n\n(cid:35)\n\nC\n\n(cid:1)C \u2212(cid:0)K\n(cid:1)C\u22121. Then\n(cid:35)\nC(cid:88)\n\n2\n\nft(S)\n\n\u2212 E\n\nrk,c\n\n.\n\nE\n\nt=1\n\nft(Gt)\n\n\u2265 \u03b2(K, C) \u00b7 max\nS\u2208P\n\nt=1\n\nk=1\n\nc=1\n\n5\n\n\fk \u00d7 {c} (here, P T\n\nObserve that Theorem 6 is similar to Theorem 3, with the addition of the E [rk,c] terms. The idea\nof the proof is to view TGBANDIT as a version of TABULARGREEDY that, instead of greedily\nselecting single (element,color) pairs gk,c \u2208 Pk \u00d7 {c}, greedily selects (element vector, color) pairs\n(cid:126)gk,c \u2208 P T\nk is the T th power of the set Pk). We allow for the case that the greedy\ndecision is made imperfectly, with additive error rk,c; this is the source of the extra terms. Once this\ncorrespondence is established, the theorem follows along the lines of Theorem 3. For a proof, see the\nextended version [9].\nCorollary 7. If TGBANDIT is run with randomized weighted majority [5] as the subroutine, then\n\n(cid:41)\n\n(cid:32)\n\nft(S)\n\n\u2212 O\n\nC\n\n(cid:33)\n(cid:112)T log |Pk|\n\n.\n\nK(cid:88)\n\nk=1\n\n(cid:34) T(cid:88)\n\n(cid:35)\nwhere \u03b2(K, C) = 1 \u2212(cid:0)1 \u2212 1\n\nft(Gt)\n\nt=1\n\nE\n\nC\n\n\u2265 \u03b2(K, C) \u00b7 max\nS\u2208P\n\n(cid:40) T(cid:88)\n(cid:1)C\u22121.\n\nt=1\n\n(cid:111)\n\n(cid:1)C \u2212(cid:0)K\n(cid:110)(cid:80)T\n\n2\n\nOptimizing for C in Corollary 7 yields (1 \u2212 1\nfactors, where OPT := maxS\u2208P\nt=1 ft(S)\n\ne )-regret \u02dc\u0398(K 3/2T 1/4\n\nis the value of the static optimum.\n\n\u221a\nOPT) ignoring logarithmic\n\n(cid:16)\n\nDealing with bandit feedback. TGBANDIT can be modi\ufb01ed to work in the bandit feedback model.\nThe idea behind this modi\ufb01cation is that on each round we \u201cexplore\u201d with some small probability, in\nsuch a way that on each round we obtain an unbiased estimate of the desired feedback values \u00afFt(Gt\u2212\nk,c+\nx) for each k \u2208 [K], c \u2208 [C], and x \u2208 Pk. This technique can be used to achieve a bound similar to\nthe one stated in Corollary 7, but with an additive regret term of O\nStronger notions of regret. By substituting in different algorithms for the subroutines Ek,c, we can\nobtain additional guarantees. For example, Blum and Mansour [3] consider online problems in which\nwe are given time-selection functions I1, I2, . . . , IM . Each time-selection function I : [T ] \u2192 [0, 1]\nassociates a weight with each round, and de\ufb01nes a corresponding weighted notion of regret in the\nnatural way. Blum and Mansour\u2019s algorithm guarantees low weighted regret with respect to all\nM time selection functions simultaneously. This can be used to obtain low regret with respect to\ndifferent (possibly overlapping) windows of time simultaneously, or to obtain low regret with respect\nto subsets of rounds that have particular features. By using their algorithm as a subroutine within\nTGBANDIT, we get similar guarantees, both in the full information and bandit feedback models.\n\n(T |V| CK) 2\n\n3 (log |V|) 1\n\n(cid:17)\n\n.\n\n3\n\n5 Evaluation\nWe evaluate TGBANDIT experimentally on two applications: Learning to rank blogs that are effective\nin detecting cascades of information, and allocating advertisements to maximize revenue.\n\n5.1 Online learning of diverse blog rankings\n\nWe consider the problem of ranking a set of blogs and news sources on the web. Our approach is\nbased on the following idea: A blogger writes a posting, and, after some time, other postings link to\nit, forming cascades of information propagating through the network of blogs.\nMore formally, an information cascade is a directed acyclic graph of vertices (each vertex corresponds\nto a posting at some blog), where edges are annotated by the time difference between the postings.\nBased on this notion of an information cascade, we would like to select blogs that detect big\ncascades (containing many nodes) as early as possible (i.e., we want to learn about an important\nevent before most other readers). In [13] it is shown how one can formalize this notion of utility\nusing a monotone submodular function that measures the informativeness of a subset of blogs.\nOptimizing the submodular function yields a small set of blogs that \u201ccovers\u201d most cascades. This\nutility function prefers diverse sets of blogs, minimizing the overlap of the detected cascades, and\ntherefore minimizing redundancy.\nThe work by [13] leaves two major shortcomings: Firstly, they select a set of blogs rather than a\nranking, which is of practical importance for the presentation on a web service. Secondly, they do\nnot address the problem of sequential prediction, where the set of blogs must be updated dynamically\nover time. In this paper, we address these shortcomings.\n\n6\n\n\f(a) Blogs: Of\ufb02ine results\n\n(b) Blogs: Online results\n\n(c) Ad display: Online results\n\nFigure 1: (a,b) Results for discounted blog ranking (\u03b3 = 0.8), in of\ufb02ine (a) and online (b) setting. (c)\nPerformance of TGBANDIT with C = 1, 2, and 4 colors for the sponsored search ad selection problem (each\nround is a query). Note that C = 1 corresponds to the online algorithm of [16, 18].\n\nk=1 \u03b3k(cid:0)g(S[k]) \u2212 g(S[k\u22121])(cid:1) . It\n\nWe de\ufb01ne the discounted value of the assignment S as f(S) =(cid:80)K\n\nResults on of\ufb02ine blog ranking.\nIn order to model the blog ranking problem, we adopt the\nassumption that different users have different attention spans: Each user will only consider blogs\nappearing in a particular subset of positions. In our experiments, we assume that the probability that\na user is willing to look at position k is proportional to \u03b3k, for some discount factor 0 < \u03b3 < 1.\nMore formally, let g be the monotone submodular function measuring the informativeness of any set\nof blogs, de\ufb01ned as in [13]. Let Pk = B \u00d7 {k}, where B is the set of blogs. Given an assignment\nS \u2208 P, let S[k] = S \u2229 {P1 \u222a P2 \u222a . . . \u222a Pk} be the assignment of blogs to positions 1 through k.\ncan be seen that f : 2V \u2192 R\u22650 is monotone submodular.\nFor our experiments, we use the data set of [13], consisting of 45,192 blogs, 16,551 cascades, and\n2 million postings collected during 12 months of 2006. We use the population affected objective\nof [13], and use a discount factor of \u03b3 = 0.8. Based on this data, we run our TABULARGREEDY\nalgorithm with varying numbers of colors C on the blog data set. Fig. 1(a) presents the results of this\nexperiment. For each value of C, we generate 200 rankings, and report both the average performance\nand the maximum performance over the 200 trials. Increasing C leads to an improved performance\nover the locally greedy algorithm (C = 1).\nResults on online learning of blog rankings. We now consider the online problem where on\neach round t we want to output a ranking St. After we select the ranking, a new set of cascades\noccurs, modeled using a separate submodular function ft, and we obtain a reward of ft(St). In\nour experiments, we choose one assignment per day, and de\ufb01ne ft as the utility associated with the\ncascades occurring on that day. Note that ft allows us to evaluate the performance of any possible\nranking St, hence we can apply TGBANDIT in the full-information feedback model.\nWe compare the performance of our online algorithm using C = 1 and C = 4. Fig. 1(b) presents the\naverage cumulative reward gained over time by both algorithms. We normalize the average reward by\nthe utility achieved by the TABULARGREEDY algorithm (with C = 1) applied to the entire data set.\nFig. 1(b) shows that the performance of both algorithms rapidly (within the \ufb01rst 47 rounds) converges\nto the performance of the of\ufb02ine algorithm. The TGBANDIT algorithm with C = 4 levels out at an\napproximately 4% higher reward than the algorithm with C = 1.\n\n5.2 Online ad display\nWe evaluate TGBANDIT for the sponsored search ad selection problem in a simple Markovian model\nincorporating the value of diverse results and complex position-dependence among clicks. In this\nmodel, each user u is de\ufb01ned by two sets of probabilities: pclick(a) for each ad a \u2208 A, and pabandon(k)\nfor each position k \u2208 [K]. When presented an assignment of ads {a1, a2, . . . , aK}, where ak\noccupies position k, the user scans the positions in increasing order. For each position k, the user\nclicks on ak with probability pclick(ak), leaving the results page forever. Otherwise, with probability\n(1 \u2212 pclick(ak)) \u00b7 pabandon(k), the user loses interest and abandons the results without clicking on\nanything. Finally, with probability (1 \u2212 pclick(ak)) \u00b7 (1 \u2212 pabandon(k)), the user proceeds to look at\nposition k + 1. The reward function ft is the number of clicks, which is either zero or one. We only\nreceive information about ft(St) (i.e., bandit feedback).\n\n7\n\n12466.577.58x 104Number of colorsPerformance AverageMaximum010020030000.20.40.60.81Number of rounds (days)Avg. normalized performance1 color4 colors1021041061031051060.680.690.70.710.720.730.740.75Number of rounds Average Payoff4 colors2 colors1 color\fIn our evaluation, there are 5 positions, 20 available ads, and two (equally frequent) types of users:\ntype 1 users interested in all positions (pabandon \u2261 0), and type 2 users that quickly lose interest\n(pabandon \u2261 0.5). There are also two types of ads, half of type 1 and half of type 2, and users are\nprobabilistically more interested in ads of their own type than those of the opposite type. Speci\ufb01cally,\nfor both types of users we set pclick(a) = 0.5 if a has the same type as the user, and pclick(a) = 0.2\notherwise. In Fig. 1(c) we compare the performance of TGBANDIT with C = 4 to the online\nalgorithm of [16, 18], based on the average of 100 experiments. The latter algorithm is equivalent\nto running TGBANDIT with C = 1. They perform similarly in the \ufb01rst 104 rounds; thereafter the\nformer algorithm dominates.\nIt can be shown that with several different types of users with distinct pclick(\u00b7) functions the of\ufb02ine\nproblem of \ufb01nding an assignment within 1 \u2212 1\ne + \u03b5 of optimal is NP-hard. This is in contrast to the\ncase in which pclick and pabandon are the same for all users; in this case the of\ufb02ine problem simply\nrequires \ufb01nding an optimal policy for a Markov decision process, which can be done ef\ufb01ciently using\nwell-known algorithms. A slightly different Markov model of user behavior which is ef\ufb01ciently\nsolvable was considered in [1]. In that model, pclick and pabandon are the same for all users, and pabandon\nis a function of the ad in the slot currently being scanned rather than its index.\n6 Related Work\nFor a general introduction to the literature on submodular function maximization, see [19]. For\napplications of submodularity to machine learning and AI see [11].\nOur of\ufb02ine problem is known as maximizing a monotone submodular function subject to a (simple)\npartition matroid constraint in the operations research and theoretical computer science communities.\nThe study of this problem culminated in the elegant (1\u22121/e) approximation algorithm of Vondr\u00b4ak [20]\nand a matching unconditional lower bound of Mirrokni et al. [14]. Vondr\u00b4ak\u2019s algorithm, called\nthe continuous-greedy algorithm, has also been extended to handle arbitrary matroid constraints [4].\nThe continuous-greedy algorithm, however, cannot be applied to our problem directly, because it\nrequires the ability to sample f(\u00b7) on infeasible sets S /\u2208 P. In our context, this means it must have\nthe ability to ask (for example) what the revenue will be if ads a1 and a2 are placed in position #1\nsimultaneously. We do not know how to answer such questions in a way that leads to meaningful\nperformance guarantees.\nIn the online setting, the most closely related work is that of Streeter and Golovin [18]. Like us, they\nconsider sequences of monotone submodular reward functions that arrive online, and develop an\nonline algorithm that uses multi-armed bandit algorithms as subroutines. The key difference from our\nwork is that, as in [16], they are concerned with selecting a set of K items rather than the more general\nproblem of selecting an assignment of items to positions addressed in this paper. Kakade et al. [10]\nconsidered the general problem of using \u03b1-approximation algorithms to construct no \u03b1-regret online\nalgorithms, and essentially proved it could be done for the class of linear optimization problems in\nwhich the cost function has the form c(S, w) for a solution S and weight vector w, and c(S, w) is\nlinear in w. However, their result is orthogonal to ours, because our objective function is submodular\nand not linear2.\n7 Conclusions\nIn this paper, we showed that important problems, such as ad display in sponsored search and\ncomputing diverse rankings of information sources on the web, require optimizing assignments\nunder submodular utility functions. We developed an ef\ufb01cient algorithm, TABULARGREEDY, which\nobtains the optimal approximation ratio of (1 \u2212 1/e) for this NP-hard optimization problem. We\nalso developed an online algorithm, TGBANDIT, that asymptotically achieves no (1 \u2212 1/e)-regret\nfor the problem of repeatedly selecting informative assignments, under the full-information and\nbandit-feedback settings. Finally, we demonstrated that our algorithm outperforms previous work on\ntwo real world problems, namely online ranking of informative blogs and ad allocation.\nAcknowledgments. This work was supported in part by Microsoft Corporation through a gift as well as\nthrough the Center for Computational Thinking at Carnegie Mellon, by NSF ITR grant CCR-0122581 (The\nAladdin Center), and by ONR grant N00014-09-1-1044.\n\n2 One may linearize a submodular function by using a separate dimension for every possible function\nargument, but this leads to exponentially worse convergence time and regret bounds for the algorithms in [10]\nrelative to TGBANDIT.\n\n8\n\n\fReferences\n[1] Gagan Aggarwal, Jon Feldman, S. Muthukrishnan, and Martin P\u00b4al. Sponsored search auctions with\n\nmarkovian users. In WINE, pages 621\u2013628, 2008.\n\n[2] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[3] Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research,\n\n8:1307\u20131324, 2007.\n\n[4] Gruia Calinescu, Chandra Chekuri, Martin P\u00b4al, and Jan Vondr\u00b4ak. Maximizing a submodular set function\n\nsubject to a matroid constraint. SIAM Journal on Computing. To appear.\n\n[5] Nicol`o Cesa-Bianchi, Yoav Freund, David Haussler, David P. Helmbold, Robert E. Schapire, and Manfred K.\n\nWarmuth. How to use expert advice. J. ACM, 44(3):427\u2013485, 1997.\n\n[6] Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. Internet advertising and the generalized\nsecond price auction: Selling billions of dollars worth of keywords. American Economic Review, 97(1):242\u2013\n259, 2007.\n\n[7] Jon Feldman and S. Muthukrishnan. Algorithmic methods for sponsored search advertising. In Zhen Liu\n\nand Cathy H. Xia, editors, Performance Modeling and Engineering. 2008.\n\n[8] Marshall L. Fisher, George L. Nemhauser, and Laurence A. Wolsey. An analysis of approximations for\n\nmaximizing submodular set functions - II. Mathematical Programming Study, (8):73\u201387, 1978.\n\n[9] Daniel Golovin, Andreas Krause, and Matthew Streeter. Online learning of assignments that maximize\n\nsubmodular functions. CoRR, abs/0908.0772, 2009.\n\n[10] Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms.\n\nIn STOC, pages 546\u2013555, 2007.\n\n[11] Andreas Krause and Carlos Guestrin. Beyond convexity: Submodularity in machine learning. Tutorial at\n\nICML 2008. http://www.select.cs.cmu.edu/tutorials/icml08submodularity.html.\n\n[12] S\u00b4ebastien Lahaie, David M. Pennock, Amin Saberi, and Rakesh V. Vohra. Sponsored search auctions. In\nNoam Nisan, Tim Roughgarden, Eva Tardos, and Vijay V. Vazirani, editors, Algorithmic Game Theory.\nCambridge University Press, New York, NY, USA, 2007.\n\n[13] Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, and Natalie\n\nGlance. Cost-effective outbreak detection in networks. In KDD, pages 420\u2013429, 2007.\n\n[14] Vahab Mirrokni, Michael Schapira, and Jan Vondr\u00b4ak. Tight information-theoretic lower bounds for welfare\n\nmaximization in combinatorial auctions. In EC, pages 70\u201377, 2008.\n\n[15] George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approximations for\n\nmaximizing submodular set functions - I. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[16] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed\n\nbandits. In ICML, pages 784\u2013791, 2008.\n\n[17] Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. Technical\n\nReport CMU-CS-07-171, Carnegie Mellon University, 2007.\n\n[18] Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. In\n\nNIPS, pages 1577\u20131584, 2008.\n\n[19] Jan Vondr\u00b4ak. Submodularity in Combinatorial Optimization. PhD thesis, Charles University, Prague,\n\nCzech Republic, 2007.\n\n[20] Jan Vondr\u00b4ak. Optimal approximation for the submodular welfare problem in the value oracle model. In\n\nSTOC, pages 67\u201374, 2008.\n\n9\n\n\f", "award": [], "sourceid": 421, "authors": [{"given_name": "Matthew", "family_name": "Streeter", "institution": null}, {"given_name": "Daniel", "family_name": "Golovin", "institution": null}, {"given_name": "Andreas", "family_name": "Krause", "institution": null}]}