{"title": "Adapting to the Shifting Intent of Search Queries", "book": "Advances in Neural Information Processing Systems", "page_first": 1829, "page_last": 1837, "abstract": "Search engines today present results that are often oblivious to recent shifts in intent. For example, the meaning of the query independence day shifts in early July to a US holiday and to a movie around the time of the box office release. While no studies exactly quantify the magnitude of intent-shifting traffic, studies suggest that news events, seasonal topics, pop culture, etc account for 1/2 the search queries. This paper shows that the signals a search engine receives can be used to both determine that a shift in intent happened, as well as find a result that is now more relevant. We present a meta-algorithm that marries a classifier with a bandit algorithm to achieve regret that depends logarithmically on the number of query impressions, under certain assumptions. We provide strong evidence that this regret is close to the best achievable. Finally, via a series of experiments, we demonstrate that our algorithm outperforms prior approaches, particularly as the amount of intent-shifting traffic increases.", "full_text": "Adapting to the Shifting Intent of Search Queries\u2217\n\nUmar Syed\u2020\n\nDepartment of Computer\nand Information Science\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\nusyed@cis.upenn.edu\n\nAleksandrs Slivkins\nMicrosoft Research\n\nMountain View, CA 94043\n\nslivkins@microsoft.com\n\nNina Mishra\n\nMicrosoft Research\n\nMountain View, CA 94043\nninam@microsoft.com\n\nAbstract\n\nSearch engines today present results that are often oblivious to recent shifts in\nintent. For example, the meaning of the query \u2018independence day\u2019 shifts in early\nJuly to a US holiday and to a movie around the time of the box of\ufb01ce release.\nWhile no studies exactly quantify the magnitude of intent-shifting traf\ufb01c, studies\nsuggest that news events, seasonal topics, pop culture, etc account for 1/2 the\nsearch queries. This paper shows that the signals a search engine receives can be\nused to both determine that a shift in intent happened, as well as \ufb01nd a result that\nis now more relevant. We present a meta-algorithm that marries a classi\ufb01er with\na bandit algorithm to achieve regret that depends logarithmically on the number\nof query impressions, under certain assumptions. We provide strong evidence that\nthis regret is close to the best achievable. Finally, via a series of experiments, we\ndemonstrate that our algorithm outperforms prior approaches, particularly as the\namount of intent-shifting traf\ufb01c increases.\n\n1\n\nIntroduction\n\nSearch engines typically use a ranking function to order results. The function scores a document by\nthe extent to which it matches the query, and documents are ordered according to this score. This\nfunction is \ufb01xed in the sense that it does not change from one query to another and also does not\nchange over time. For queries such as \u2018michael jackson\u2019 traditional ranking functions that value\nfeatures such as high page rank will not work since documents new to the web will not have accrued\nsuf\ufb01cient inlinks. Thus, a search engine\u2019s ranking function should not be \ufb01xed; different results\nshould surface depending on the temporal context.\nIntuitively, a query is \u201cintent-shifting\u201d if the most desired search result(s) change over time. More\nconcretely, a query\u2019s intent has shifted if the click distribution over search results at some time\ndiffers from the click distribution at a later time. For the query \u2018tomato\u2019 on the heels of a tomato\nsalmonella outbreak, the probability a user clicks on a news story describing the outbreak increases\nwhile the probability a user clicks on the Wikipedia entry for tomatoes rapidly decreases. There\nare studies that suggest that queries likely to be intent-shifting \u2014 such as pop culture, news events,\ntrends, and seasonal topics queries \u2014 constitute roughly half of the search queries that a search\nengine receives [10].\nThe goal of this paper is to devise an algorithm that quickly adapts search results to shifts in user\nintent. Ideally, for every query and every point in time, we would like to display the search result that\nusers are most likely to click. Since traditional ranking features like PageRank [4] change slowly\nover time, and may be misleading if user intent has shifted very recently, we want to use just the\nobserved click behavior of users to decide which search results to display.\n\n\u2217Full version of this paper [20] is available on arxiv.org. In the present version, all proofs are omitted.\n\u2020This work was done while the author was an intern at Microsoft Research and a student in the Department\n\nof Computer Science, Princeton University.\n\n1\n\n\fThere are many signals a search engine can use to detect when the intent of a query shifts. Query\nfeatures such as as volume, abandonment rate, reformulation rate, occurrence in news articles, and\nthe age of matching documents can all be used to build a classi\ufb01er which, given a query, determines\nwhether the intent has shifted. We refer to these features as the context, and an occassion when a\nshift in intent occurs as an event.\nOne major challenge in building an event classi\ufb01er is obtaining training data. For most query and\ndate combinations (e.g. \u2018tomato, 06/09/2008\u2019), it will be dif\ufb01cult even for a human labeler to recall\nin hindsight whether an event related to the query occurred on that date. In this paper, we propose a\nnovel solution that learns from unlabeled contexts and user click activity.\nContributions. We describe a new algorithm that leverages the information contained in contexts.\nOur algorithm is really a meta-algorithm that combines a bandit algorithm designed for the event-\nfree setting with an online classi\ufb01cation algorithm. The classi\ufb01er uses the contexts to predict when\nevents occur, and the bandit algorithm \u201cstarts over\u201d on positive predictions. The bandit algorithm\nprovides feedback to the classi\ufb01er by checking, soon after each of the classi\ufb01er\u2019s positive predic-\ntions, whether the optimal search result actually changed. The key technical hurdle in proving a\nregret bound is handling events that happen during the \u201cchecking\u201d phase.\nFor suitable choices of the bandit and classi\ufb01er subroutines, the regret incurred by our meta-\nalgorithm is (under certain mild assumptions) at most O(k + dF )( n\n\u2206 log T ), where k is the number\nof events, dF is a certain measure of the complexity of the concept class F used by the classi\ufb01er,\nn is the number of possible search results, \u2206 is the \u201cminimum suboptimality\u201d of any search result\n(de\ufb01ned formally in Section 2), and T is the total number of impressions. This regret bound has a\nvery weak dependence on T , which is highly desirable for search engines that receive much traf\ufb01c.\nThe context turns out to be crucial for achieving logarithmic dependence on T . Indeed, we show that\nany bandit algorithm that ignores context suffers regret \u2126(\u221aT ), even when there is only one event.\nUnlike many lower bounds for bandit problems, our lower bound holds even when \u2206 is a constant\nindependent of T . We also show that assuming a logarithmic dependence on T , the dependence on\nk and dF is essentially optimal.\nFor empirical evaluation, we ideally need access to the traf\ufb01c of a real search engine so that search\nresults can be adapted based on real-time click activity. Since we did not have access to live traf-\n\ufb01c, we instead conduct a series of synthetic experiments. The experiments show that if there are\nno events then the well-studied UCB1 algorithm [2] performs the best. However, when many dif-\nferent queries experience events, the performance of our algorithm signi\ufb01cantly outperforms prior\ntechniques.\n\n2 Problem Formulation and Preliminaries\n\nWe view the problem of deciding which search results to display in response to user click behavior\nas a bandit problem, a well-known type of sequential decision problem. For a given query q, the\ntask is to determine, at each round t \u2208 {1, . . . , T} that q is issued by a user to our search engine, a\nsingle result it \u2208 {1, . . . , n} to display.1 This result is clicked by the user with probability pt(it).\nA bandit algorithm A chooses it using only observed information from previous rounds, i.e., all\npreviously displayed results and received clicks. The performance of an algorithm A is measured by\nits regret: R(A) , E hPT\nt = arg maxi pt(i) is one\nwith maximum click probability, and the expectation is taken over the randomness in the clicks and\nthe internal randomization of the algorithm. Note our unusually strong de\ufb01nition of regret: we are\ncompeting against the best result on every round.\nWe call an event any round t where pt\u22121 6= pt.\nIt is reasonable to assume that the number of\nevents k \u226a T , since we believe that abrupt shifts in user intent are relatively rare. Most existing\nbandit algorithms make no attempt to predict when events will occur, and consequently suffer regret\n\u2126(\u221aT ). On the other hand, a typical search engine receives many signals that can be used to predict\nevents, such as bursts in query reformulation, average age of retrieved document, etc.\n\nt ) \u2212 pt(it)i, where an optimal result i\u2217\n\nt=1 pt(i\u2217\n\n1For simplicity, we focus on the task of returning a single result, and not a list of results. Techniques\n\nfrom [19] may be adopted to \ufb01nd a good list of results.\n\n2\n\n\fWe assume that our bandit algorithm receives a context xt \u2208 X at each round t, and that there exists\na function f \u2208 F, in some known concept class F, such that f (xt) = +1 if an event occurs at round\nt, and f (xt) = \u22121 otherwise.2 In other words, f is an event oracle. At each round t, an eventful\nbandit algorithm must choose a result it using only observed information from previous rounds, i.e.,\nall previously displayed results and received clicks, plus all contexts up to round t.\nIn order to develop an ef\ufb01cient eventful bandit algorithm, we make an additional key assumption: At\nleast one optimal result before an event is signi\ufb01cantly suboptimal after the event. More precisely,\nwe assume there exists a minimum shift \u01ebS > 0 such that, whenever an event occurs at round t,\nt\u22121. For our\nwe have pt(i\u2217\nproblem setting, this assumption is relatively mild: the events we are interested in tend to have a\nrather dramatic effect on the optimal search results. Moreover, our bounds are parameterized by\n\u2206 = mint mini6=i\u2217\n\nt ) \u2212 \u01ebS for at least one previously optimal search result i\u2217\n\nt\u22121) < pt(i\u2217\n\nt pt(i\u2217\n\nt ) \u2212 pt(i), the minimum suboptimality of any suboptimal result.\n\n3 Related Work\n\nWhile there has been a substantial amount of work on ranking algorithms [11, 5, 13, 8, 6], all of these\nresults assume that there is a \ufb01xed ranking function to learn, not one that shifts over time. Online\nbandit algorithms (see [7] for background) have been considered in the context of ranking. For\ninstance, Radlinski et al [19] showed how to compose several instantiations of a bandit algorithm\nto produce a ranked list of search results. Pandey et al [18] showed that bandit algorithms can\nbe effective in serving advertisements to search engine users. These approaches also assume a\nstationary inference problem.\nEven though existing bandit work does not address our problem, there are two key algorithms that\nwe do use in our work. The UCB1 algorithm [2] assumes \ufb01xed click probabilities and has regret at\nmost O( n\n\u2206 log T ). The EXP3.S algorithm [3] assumes that click probabilities can change on every\nround and has regret at most O(kpnT log(nT )) for arbitrary pt\u2019s. Note that the dependence of\nEXP3.S on T is substantially stronger.\nThe \u201ccontextual bandits\u201d problem setting [21, 17, 12, 16, 14] is similar to ours. A key difference\nis that the context received in each round is assumed to contain information about the identity of\nt , a considerably stronger assumption than we make. Our context includes only\nan optimal result i\u2217\nside information such as volume of the query, but we never actually receive information about the\nidentity of the optimal result.\nA different approach is to build a statistical model of user click behavior. This approach has been\napplied to the problem of serving news articles on the web. Diaz [9] used a regularized logistic\nmodel to determine when to surface news results for a query. Agarwal et al [1] used several models,\nincluding a dynamic linear growth curve model.\nThere has also been work on detecting bursts in data streams. For example, Kleinberg [15] describes\na state-based model for inferring stages of burstiness. The goal of our work is not to detect bursts,\nbut rather to predict shifts in intent.\nIn a recent concurrent and independent work, Yu et al [22] studied bandit problems with \u201cpiecewise-\nstationary\u201d distributions, a notion that closely resembles our de\ufb01nition of events. However, they\nmake different assumptions than we do about the information a bandit algorithm can observe. Ex-\npressed in the language of our problem setting, they assume that from time-to-time a bandit algo-\nrithm receives information about how users would have responded to search results that are never\nactually displayed. For us, this assumption is clearly inappropriate.\n\n4 Bandit with Classi\ufb01er\n\nOur algorithm is called BWC, or \u201cBandit with Classi\ufb01er\u201d. The high-level idea is to use a bandit\nalgorithm such as UCB1, restart it every time the classi\ufb01er predicts an event, and use subsequent\nrounds to generate feedback for the classi\ufb01er. We will present our algorithm in a modular way,\nas a meta-algorithm which uses the following two components: classifier and bandit. In\n\n2In some of our analysis, we require that contexts be restricted to a strict subset of X ; the value of f outside\n\nthis subset will technically be null.\n\n3\n\n\feach round, classifier inputs a context xt and outputs a \u201cpositive\u201d or \u201cnegative\u201d prediction\nof whether an event has happened in this round. Also, it may input labeled samples of the form\n(x, l), where x is a context and l is a boolean label, which it uses for training. Algorithm bandit\nis a bandit algorithm that is tuned for the event-free runs and provides the following additional\nfunctionality: after each round t of execution, it outputs the t-th round guess: a pair (G+, G\u2212),\nwhere G+ and G\u2212 are subsets of arms that it estimates to be optimal and suboptimal, respectively.3\nSince both classifier and bandit make predictions (about events and arms, respectively), for\nclarity we use the term \u201cguess\u201d exclusively to refer to predictions made by bandit, and reserve the\nterm \u201cprediction\u201d for classifier.\nThe algorithm operates as follows. It runs in phases of two alternating types: odd phases are called\n\u201ctesting\u201d phases, and even phases are called \u201cadapting\u201d phases. The \ufb01rst round of phase j is denoted\ntj. In each phase we run a fresh instance of bandit. Each testing phase lasts for L rounds, where\nL is a parameter. Each adapting phase j ends as soon as classifier predicts \u201cpositive\u201d; the\nround t when this happens is round tj+1. Phase j is called full if it lasts at least L rounds. For a full\nphase j, let (G+\nj ) be the L-th round guess in this phase. After each testing phase j, we generate\na boolean prediction l of whether there was an event in the \ufb01rst round thereof. Speci\ufb01cally, letting\ni be the most recent full phase before j, we set ltj = false if and only if G+\n6= \u2205. If ltj\nis false, the labeled sample (xtj , ltj ) is fed back to the classi\ufb01er. Note that classifier never\nreceives true-labeled samples. Pseudocode for BWC is given in Algorithm 1.\nDisregarding the interleaved testing phases for the moment, BWC restarts bandit whenever\nclassifier predicts \u201cpositive\u201d, optimistically assuming that the prediction is correct. By our\nassumption that events cause some optimal arm to become signi\ufb01cantly suboptimal (see Section 2),\nan incorrect prediction should result in G+\nj 6= \u2205, where i is a phase before the putative event,\nand j is a phase after it. However, to ensure that the estimates Gi and Gj are reliable, we require\nthat phases i and j are full. And to ensure that the full phases closest to a putative event are not too\nfar from it, we insert a full testing phase every other phase.\n\ni \u2229 G\u2212\n\ni \u2229 G\u2212\n\nj , G\u2212\n\nj\n\nfor round t = tj . . . tj + L do\n\nSelect arm it according to bandit.\nObserve pt(it) and update bandit.\n\nInitialize bandit. Let tj be current round.\nif j is odd then\n\nAlgorithm 1 BWC Algorithm\n1: Given: Parameter L, a (L, \u01ebS)-testable bandit, and a safe classifier.\n2: for phase j = 1, 2, . . . do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n\nSelect arm it according to bandit.\nObserve pt(it) and update bandit; pass context xt to classifier.\nif classifier predicts \u201cpositive\u201d then\n\nLet i be the most recent full phase before j.\nIf G+\n\nfor round t = tj, tj + 1, . . . do\n\ni \u2229 G\u2212\n\nj 6= \u2205 let ltj = false and pass training example (xtj , ltj ) to classifier.\n\nelse\n\nTerminate inner for loop.\n\nLet S be the set of all contexts which correspond to an event. When the classi\ufb01er receives a context\nx and predicts a \u201cpositive\u201d, this prediction is called a true positive if x \u2208 S, and a false positive\notherwise. Likewise, when the classi\ufb01er predicts a \u201cnegative\u201d, the prediction is called a true negative\nif x 6\u2208 S, and a false negative otherwise. The sample (x, l) is correctly labeled if l = (x \u2208 S).\nWe make the following two assumptions. First, classifier is safe for a given concept class:\nif it inputs only correctly labeled samples, it never outputs a false negative. Second, bandit is\n(L, \u01eb)-testable, in the following sense. Consider an event-free run of bandit, and let (G+, G\u2212)\nbe its L-th round guess. Then with probability at least 1 \u2212 T \u22122, each optimal arm lies in G+ but\nnot in G\u2212, and any arm that is at least \u01eb-suboptimal lies in G\u2212 but not in G+. So an (L, \u01eb)-testable\n\n3Following established convention, we call the options available to a bandit algorithm \u201carms\u201d. In our setting,\n\neach arm corresponds to a search result.\n\n4\n\n\fbandit algorithm is one that, after L rounds, has a good guess of which arms are optimal and which\nare at least \u01eb-suboptimal.\nFor correctness, we require bandit to be (L, \u01ebS)-testable, where \u01ebS is the minimum shift. The\nperformance of bandit is quanti\ufb01ed via its event-free regret, i.e. regret on the event-free runs.\nLikewise, for correctness we need classifier to be safe; we quantify its performance via the\nmaximum possible number of false positives, in the precise sense de\ufb01ned below. We assume that the\nstate of classifier is updated only if it receives a labeled sample, and consider a game in which\nin each round t, classifier receives a context xt 6\u2208 S, outputs a (false) positive, and receives\na (correctly) labeled sample (x, false). For a given context set X and a given concept class F,\nlet the FP-complexity of the classi\ufb01er be the maximal possible number of rounds in such a game,\nwhere the maximum is taken over all event oracles f \u2208 F and all possible sequences {xt}. Put\nsimply, the FP-complexity of classifier is the maximum number of consecutive false positives\nit can make when given correctly labeled examples.\nWe will discuss ef\ufb01cient implementations of a safe classifier and a (L, \u01eb)-testable bandit\nin Sections 5 and Section 6, respectively. We present provable guarantees for BWC in a modular\nway, in terms of FP-complexity, event-free regret, and the number of events. The main technical\ndif\ufb01culty in the analysis is that the correct operation of the components of BWC \u2014 classifier\nand bandit \u2014 is interdependent. In particular, one challenge is to handle events that occur during\nthe \ufb01rst L rounds of a phase; these events may potentially \u201ccontaminate\u201d the L-th round guesses and\ncause incorrect feedback to classifier.\nTheorem 1. Consider an instance of the eventful bandit problem with number of rounds T , n\narms, k events and minimum shift \u01ebS. Consider algorithm BWC with parameter L and compo-\nnents classifier and bandit such that for this problem instance, classifier is safe, and\nbandit is (L, \u01ebS)-testable. If any two events are at least 2L rounds apart, then the regret of BWC\nis\n\nR(T ) \u2264 (2k + d) R0(T ) + (k + d) R0(L) + kL.\n\nwhere d is the FP-complexity of the classi\ufb01er and R0(\u00b7) is the event-free regret of bandit.\nRemarks. The proof is available in the full version [20]. In our implementations of bandit, L =\n\u0398( n\nlog T ) suf\ufb01ces. In the +kL term in (1), the k can be replaced by the number of testing phases\n\u01ebS\nthat contain both a false positive in round 1 of the phase and an actual event later in the phase; this\nnumber can potentially be much smaller than k.\n\n(1)\n\n5 Safe Classi\ufb01er\n\nWe seek a classi\ufb01er that is safe for a given concept class F and has low FP-complexity. We present\na classi\ufb01er whose FP-complexity is bounded in terms of the following property of F:\nDe\ufb01nition 1. De\ufb01ne the safe function SF : 2X \u2192 2X of F as follows: x \u2208 SF (N ) if and only\nif there is no concept f \u2208 F such that: f (y) = \u22121 for all y \u2208 N and f (x) = +1. The diameter\nof F, denoted dF , is equal to the length of the longest sequence x1, . . . , xm \u2208 X such that xt /\u2208\nSF ({x1, . . . , xt\u22121}) for all t = 1, . . . , m.\nSo if N contains only true negatives, then SF (N ) contains only true negatives. This property sug-\ngests that SF can be used to construct a safe classi\ufb01er SafeCl, which operates as follows: It\nmaintains a set of false-labeled examples N, initially empty. When input an unlabeled context x,\nSafeCl outputs a positive prediction if and only if x /\u2208 SF (N ). After making a positive predic-\ntion, SafeCl inputs a labeled example (x, l). If l = false, then x is added to N; otherwise x is\ndiscarded. Clearly, SafeCl is a safe classifer.\nIn the full version [20], we show that the FP-complexity of SafeCl is at most the diameter dF ,\nwhich is to be expected: FP-complexity is a property of a classi\ufb01er, and diameter is the completely\nanalogous property for SF . Moreover, we give examples of common concept classes with ef\ufb01ciently\ncomputable safe functions. For example, if F is the space of hyperplanes with \u201cmargin\u201d at least \u03b4\n(probably the most commonly-used concept class in machine learning), then SF (N ) is the convex\nhull of the examples in N, extended in all directions by a \u03b4.\nBy using SafeCl as our classi\ufb01er, we introduce dF into the regret bound of bwc, and this quantity\ncan be large. However, in Section 7 we show that the regret of any algorithm must depend on dF ,\nunless it depends strongly on the number of rounds T .\n\n5\n\n\f6 Testable Bandit Algorithms\n\n\u2206 log L, \u221anL log L)).\n\nIn this section we will consider the stochastic n-armed bandit problem. We are looking for (L, \u01eb)-\ntestable algorithms with low regret. The L will need to be suf\ufb01ciently large, on the order of \u2126(n\u01eb\u22122).\nA natural candidate would be algorithm UCB1 from [2] which does very well on regret. Un-\nfortunately, it does not come with a guarantee of (L, \u01eb)-testability. One simple \ufb01x is to choose\nat random between arms in the \ufb01rst L rounds, use these samples to form the best guess, in\na straightforward way, and then run UCB1. However, in the \ufb01rst L rounds this algorithm in-\ncurs regret of \u2126(L), which is very suboptimal. For instance, for UCB1 the regret would be\nR(L) \u2264 O(min( n\nIn this section, we develop an algorithm which has the same regret bound as UCB1, and is (L, \u01eb)-\ntestable. We state this result more generally, in terms of estimating expected payoffs; we believe it\nmay be of independent interest. The (L, \u01eb)-testability is then an easy corollary.\nSince our analysis in this section is for the event-free setting, we can drop the subscript t from much\nof our notation. Let p(u) denote the (time-invariant) expected payoff of arm u. Let p\u2217 = maxu p(u),\nand let \u2206(u) = p\u2217 \u2212 p(u) be the \u201csuboptimality\u201d of arm u. For round t, let \u00b5t(u) be the sample\naverage of arm u, and let nt(u) be the number of times arm u has been played.\nWe will use a slightly modi\ufb01ed algorithm UCB1 from [2], with a signi\ufb01cantly extended analysis.\nRecall that in each round t algorithm UCB1 chooses an arm u with the highest index It(u) =\n\u00b5t(u) + rt(u), where rt(u) = p8 log(t)/nt(u) is a term that we\u2019ll call the con\ufb01dence radius whose\nmeaning is that |p(u) \u2212 \u00b5t(u)| \u2264 rt(u) with high probability. For our purposes here it is instructive\nto re-write the index as It(u) = \u00b5t(u) + \u03b1 rt(u) for some parameter \u03b1. Also, to better bound the\nearly failure probability we will re-de\ufb01ne the con\ufb01dence radius as rt(u) = p8 log(t0 + t)/nt(u)\nfor some parameter t0. We will denote this parameterized version by UCB1(\u03b1, t0). Essentially, the\noriginal analysis of UCB1 in [2] carries over; we omit the details.\nOur contribution concerns estimating the \u2206(u)\u2019s. We estimate the maximal expected reward p\u2217 via\nthe sample average of an arm that has been played most often. More precisely, in order to bound the\nfailure probability we consider a arm that has been played most often in the last t/2 rounds. For a\ngiven round t let vt be one such arm (ties broken arbitrarily), and let \u2206t(u) = \u00b5t(vt) \u2212 \u00b5t(u) will\nbe our estimate of \u2206(u). We express the \u201cquality\u201d of this estimate as follows:\nTheorem 2. Consider the stochastic n-armed bandits problem. Suppose algorithm UCB1(6, t0) has\nbeen played for t steps, and t + t0 \u2265 32. Then with probability at least 1 \u2212 (t0 + t)\u22122 for any arm\nu we have\n(2)\n\n4 \u2206(u) + \u03b4(t)\n\n|\u2206(u) \u2212 \u2206t(u)| < 1\n\nt log(t + t0)).\n\nwhere \u03b4(t) = O(p n\nRemark. Either we know that \u2206(u) is small, or we can approximate it up to a constant factor.\nSpeci\ufb01cally, if \u03b4(t) < 1\n\n2 \u2206t(u) then \u2206(u) \u2264 2 \u2206t(u) \u2264 5 \u2206(u) else \u2206(u) \u2264 4\u03b4(t).\nLet us convert UCB1(6, T ) into an (L, \u01eb)-testable algorithm, as long as L \u2265 \u2126( n\n\u01eb2 log T ). The t-th\nt = {u : \u2206t(u) \u2264 \u01eb/4} and G\u2212\nround best guess (G+\nt = {u : \u2206t(u) >\n\u01eb/2}. Then the resulting algorithm is (L, \u01eb)-testable assuming that \u03b4(L) \u2264 \u01eb/4, where \u03b4(t) is from\nTheorem 2. The proof is in the full version [20].\n\nt ) is de\ufb01ned as G+\n\nt , G\u2212\n\n7 Upper and Lower Bounds\n\nPlugging the classi\ufb01er from Section 5 and the bandit algorithm from Section 6 into the meta-\nalgorithm from Section 4, we obtain the following numerical guarantee.\nTheorem 3. Consider an instance S of the eventful bandit problem with with number of rounds\nT , n arms and k events, minimum shift \u01ebS, minimum suboptimality \u2206, and concept class diam-\neter dF . Assume that any two events are at least 2L rounds apart, where L = \u0398( n\nlog T ).\n\u01eb2\nS\nConsider the BWC algorithm with parameter L and components classifier and bandit\nas presented, respectively, in Section 5 and Section 6. Then the regret of BWC is R(T ) \u2264\n(cid:16)(3k + 2dF ) n\n\nS (cid:17) (log T ).\n\n\u2206 + k n\n\u01eb2\n\n6\n\n\fWhile the linear dependence on n in this bound may seem large, note that without additional as-\nsumptions, regret must be linear in n, since each arm must be pulled at least once. In an actual\nsearch engine application, the arms can be restricted to, say, the top ten results that match the query.\nWe now state two lower bounds about eventful bandit problems; the proofs are in the full ver-\nsion [20]. Theorem 4 shows that in order to achieve regret that is logarithmic in the number of\nrounds, a context-aware algorithm is necessary, assuming there is at least one event. Incidentally,\nthis lowerbound can be easily extended to prove that, in our model, no algorithm can achieve loga-\nrithmic regret when an event oracle f is not contained in the concept class F.\nTheorem 4. Consider the eventful bandit problem with number of rounds T , two arms, minimum\nshift \u01ebS and minimum suboptimality \u2206, where \u01ebS = \u2206 = \u01eb, for an arbitrary \u01eb \u2208 (0, 1\n2 ). For any\ncontext-ignoring bandit algorithm A, there exists a problem instance with a single event such that\nregret RA(T ) \u2265 \u2126(\u01eb\u221aT ).\n\nTheorem 5 proves that in Theorem 3, linear dependence on k + dF is essentially unavoidable. If\nwe desire a regret bound that has logarithmic dependence on the number of rounds, then a linear\ndependence on k + dF is necessary.\nTheorem 5. Consider the eventful bandit problem with number of rounds T and concept class\ndiameter dF . Let A be an eventful bandit algorithm. Then there exists a problem instance with n\narms, k events, minimum shift \u01ebS, minimum suboptimality \u2206, where \u01ebS = \u2206 = \u01eb, for any given\nvalues of k \u2265 1, n \u2265 3, and \u01eb \u2208 (0, 1\nMoreover, there exists a problem instance with two arms, a single event, event threshold \u0398(1) and\nminimum suboptimality \u0398(1) such that regret RA(T ) \u2265 \u2126(max(T 1/3, dF )) log T .\n\n4 ), such that RA(T ) \u2265 \u2126(k n\n\n\u01eb ) log(T /k).\n\n8 Experiments\n\nTo truly demonstrate the bene\ufb01ts of BWC requires real-time manipulation of search results. Since we\ndid not have the means to deploy a system that monitors click/skip activity and correspondingly al-\nters search results with live users, we describe a collection of experiments on synthetically generated\ndata.\nWe begin with a head-to-head comparison of BWC versus a baseline UCB1 algorithm and show\nthat BWC\u2019s performance improves substantially upon UCB1. Next, we compare the performance of\nthese algorithms as we vary the fraction of intent-shifting queries: as the fraction increases, BWC\u2019s\nperformance improves even further upon prior approaches. Finally, we compare the performance\nas we vary the number of features. While our theoretical results suggest that regret grows with the\nnumber of features in the context space, in our experiments, we surprisingly \ufb01nd that BWC is robust\nto higher dimensional feature spaces.\nSetup: We synthetically generate data as follows. We assume that there are 100 queries where the\ntotal number of times these queries are posed is 3M. Each query has \ufb01ve search results for a user\nto select from. If a query does not experience any events \u2014 i.e., it is not \u201cintent-shifting\u201d \u2014 then\nthe optimal search result is \ufb01xed over time; otherwise the optimal search result may change. Only\n10% of the queries are intent-shifting, with at most 10 events per such query. Due to the random\nnature with which data is generated, regret is reported as an average over 10 runs. The event oracle\nis an axis-parallel rectangle anchored at the origin, where points inside the box are negative and\npoints outside the box are positive. Thus, if there are two features, say query volume and query\nabandonment rate, an event occurs if and only if both the volume and abandonment rate exceed\ncertain thresholds.\nBandit with Classi\ufb01er (BWC): Figure 1(a) shows the average cumulative regret over time of three\nalgorithms. Our baseline comparison is UCB1 which assumes that the best search result is \ufb01xed\nthroughout.\nIn addition, we compare to an algorithm we call ORA, which uses the event oracle\nto reset UCB1 whenever an event occurs. We also compared to EXP3.S, but its performance was\ndramatically worse and thus we have not included it in the \ufb01gure.\nIn the early stages of the experiment before any intent-shifting event has happened, UCB1 performs\nthe best. BWC\u2019s safe classi\ufb01er makes many mistakes in the beginning and consequently pays the\nprice of believing that each query is experiencing an event when in fact it is not. As time progresses,\nBWC\u2019s classi\ufb01er makes fewer mistakes, and consequently knows when to reset UCB1 more accu-\n\n7\n\n\fx 104\n\n4.5\n\n4\n\n3.5\n\n3\n\nt\n\n \n\nORA\nBWC\nUCB1\n\ne\nr\ng\ne\nr\n \ne\nv\ni\nt\n\nl\n\na\nu\nm\nu\nC\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n \n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\nTime (impressions)\n\n3\nx 104\n\nORA\nBWC\nUCB1\nEXP3.S\n\n0\n\n17.2\n17.8\n17.2\n78.4\n\n1/8\n22.8\n24.6\n34.1\n123.7\n\n1/4\n30.4\n39.9\n114.9\n180.2\n\n3/8\n33.8\n46.7\n84.2\n197.6\n\n1/2\n39.5\n99.4\n140.0\n243.1\n\nORA\nBWC\nUCB1\nEXP3.S\n\n10\n21.9\n23.1\n32.3\n111.6\n\n20\n23.2\n24.4\n33.5\n109.4\n\n30\n21.9\n22.9\n31.1\n112.5\n\n40\n22.8\n23.7\n37.4\n121.3\n\nFigure 1: (a) (Left) BWC\u2019s cumulative regret compared to UCB1 and ORA (UCB1 with an oracle\nindicating the exact locations of the intent-shifting event) (b) (Right, Top Table) Final regret (in\nthousands) as the fraction of intent-shifting queries varies. With more intent-shifting queries, BWC\u2019s\nadvantage over prior approaches improves. (c) (Right, Bottom Table) Final regret (in thousands) as\nthe number of features grows.\n\nrately. UCB1 alone ignores the context entirely and thus incurs substantially larger cumulative regret\nby the end.\nFraction of Intent-Shifting Queries: In the next experiment, we varied the fraction of intent-\nshifting queries. Figure 1(b) shows the result of changing the distribution from 0, 1/8, 1/4, 3/8 and\n1/2 intent-shifting queries. If there are no intent-shifting queries, then UCB1\u2019s regret is the best. We\nexpect this outcome since BWC\u2019s classi\ufb01er, because it is safe, initially assumes that all queries are\nintent-shifting and thus needs time to learn that in fact no queries are intent-shifting. On the other\nhand, BWC\u2019s regret dominates the other approaches, especially as the fraction of intent-shifting\nqueries grows. EXP3.S\u2019s performance is quite poor in this experiment \u2013 even when all queries are\nintent-shifting. The reason is that even when a query is intent-shifting, there are at most 10 intent-\nshifting events, i.e., each query\u2019s intent is not shifting all the time.\nWith more intent-shifting queries, the expectation is that regret monotonically increases. In general,\nthis seems to be true in our experiment. There is however a decrease in regret going from 1/4 to 3/8\nintent-shifting queries. We believe that this is due to the fact that each query has at most 10 intent-\nshifting events spread uniformly and it is possible that there were fewer events with potentially\nsmaller shifts in intent in those runs. In other words, the standard deviation of the regret is large.\nOver the ten 3/8 intent-shifting runs for ORA, BWC, UCB1 and EXP3.S, the standard deviation was\nroughly 1K, 10K, 12K and 6K respectively.\nNumber of Features: Finally, we comment on the performance of our approach as the number of\nfeatures grows. Our theoretical results suggest that BWC\u2019s performance should deteriorate as the\nnumber of features grows. Surprisingly, BWC\u2019s performance is consistently close to the Oracle\u2019s.\nIn Figure 1(b), we show the cumulative regret after 3M impressions as the dimensionality of the\ncontext vector grows from 10 to 40 features. BWC\u2019s regret is consistently close to ORA as the\nnumber of features grows. On the other hand, UCB1\u2019s regret though competitive is worse than BWC,\nwhile EXP3.S\u2019s performance is across the board poor. Note that both UCB1 and EXP3.S\u2019s regret is\ncompletely independent of the number of features. The standard deviation of the regret over the 10\nruns is substantially lower than the previous experiment. For example, over 10 features, the standard\ndeviation was 355, 1K, 5K, 4K for ORA, BWC, UCB1 and EXP3.S, respectively.\n\n9 Future Work\n\nThe main question left for future work is testing this approach in a real setting. Since gaining access\nto live traf\ufb01c is dif\ufb01cult, it would be interesting to \ufb01nd ways to rewind the search logs to simulate\nlive traf\ufb01c.\nAcknowledgements. We thank Rakesh Agrawal, Alan Halverson, Krishnaram Kenthapadi, Robert\nKleinberg, Robert Schapire and Yogi Sharma for their helpful comments and suggestions.\n\n8\n\n\fReferences\n[1] Deepak Agarwal, Bee-Chung Chen, Pradheep Elango, Nitin Motgi, Seung-Taek Park, Raghu Ramakrish-\nnan, Scott Roy, and Joe Zachariah. Online models for content optimization. In 22nd Advances in Neural\nInformation Processing Systems (NIPS), 2008.\n\n[2] Peter Auer, Nicol`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2-3):235\u2013256, 2002.\n\n[3] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM J. Comput., 32(1):48\u201377, 2002.\n\n[4] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer\n\nNetworks and ISDN Systems, 30(1\u20137):107\u2013117, 1998.\n\n[5] Christopher J. C. Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and\nGregory N. Hullender. Learning to rank using gradient descent. In 22nd Intl. Conf. on Machine Learning\n(ICML), 2005.\n\n[6] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach\n\nto listwise approach. In 24th Intl. Conf. on Machine Learning (ICML), 2007.\n\n[7] Nicol`o Cesa-Bianchi and G\u00b4abor Lugosi. Prediction, learning, and games. Cambridge University Press,\n\n2006.\n\n[8] William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. J. of Arti\ufb01cial\n\nIntelligence Research, 10:243\u2013270, 1999.\n\n[9] Fernando Diaz. Integration of news content into web results. In 2nd Intl. Conf. on Web Search and Data\n\nMining, pages 182\u2013191, 2009.\n\n[10] D. Fallows. Search engine users. Pew Internet and American Life Project, 2005.\n[11] Yoav Freund, Raj Iyer, Robert E. Schapire, and Yoram Singer. An ef\ufb01cient boosting algorithm for com-\n\nbining preferences. J. of Machine Learning Research, 4:933\u2013969, 2003.\n\n[12] Elad Hazan and Nimrod Megiddo. Online Learning with Prior Knowledge.\n\nLearning Theory (COLT), pages 499\u2013513, 2007.\n\nIn 20th Conference on\n\n[13] Thorsten Joachims. Optimizing search engines using clickthrough data. In 8th ACM SIGKDD Intl. Conf.\n\non Knowledge Discovery and Data Mining (KDD), 2002.\n\n[14] Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Ef\ufb01cient bandit algorithms for online multi-\n\nclass prediction. In 25th Intl. Conf. on Machine Learning (ICML), 2008.\n\n[15] Jon M. Kleinberg. Bursty and hierarchical structure in streams.\n\nKnowledge Discovery and Data Mining (KDD), 2002.\n\nIn 8th ACM SIGKDD Intl. Conf. on\n\n[16] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side informa-\n\ntion. In 21st Advances in Neural Information Processing Systems (NIPS), 2007.\n\n[17] Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti, and Vanja Josifovski. Bandits for Taxonomies:\n\nA Model-based Approach. In SIAM Intl. Conf. on Data Mining (SDM), 2007.\n\n[18] Sandeep Pandey, Deepayan Chakrabarti, and Deepak Agarwal. Multi-armed Bandit Problems with De-\n\npendent Arms. In 24th Intl. Conf. on Machine Learning (ICML), 2007.\n\n[19] Filip Radlinski, Robert Kleinberg, and Thorsten Joachims. Learning diverse rankings with multi-armed\n\nbandits. In 25th Intl. Conf. on Machine Learning (ICML), 2008.\n\n[20] Umar Syed, Aleksandrs Slivkins, and Nina Mishra. Adapting to the shifting intent of search queries.\n\nTechnical report. Available from arXiv.\n\n[21] Chih-Chun Wang, Sanjeev R. Kulkarni, and H. Vincent Poor. Bandit problems with side observations.\n\nIEEE Trans. on Automatic Control, 50(3):338355, 2005.\n\n[22] Jia Yuan Yu and Shie Mannor. Piecewise-stationary bandit problems with side observations. In 26th Intl.\n\nConf. on Machine Learning (ICML), 2009.\n\n9\n\n\f", "award": [], "sourceid": 983, "authors": [{"given_name": "Umar", "family_name": "Syed", "institution": null}, {"given_name": "Aleksandrs", "family_name": "Slivkins", "institution": null}, {"given_name": "Nina", "family_name": "Mishra", "institution": null}]}