{"title": "Adaptive Submodular Maximization in Bandit Setting", "book": "Advances in Neural Information Processing Systems", "page_first": 2697, "page_last": 2705, "abstract": "Maximization of submodular functions has wide applications in machine learning and artificial intelligence. Adaptive submodular maximization has been traditionally studied under the assumption that the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. In this paper, we study the scenario where the expected gain is initially unknown and it is learned by interacting repeatedly with the optimized function. We propose an efficient algorithm for solving our problem and prove that its expected cumulative regret increases logarithmically with time. Our regret bound captures the inherent property of submodular maximization, earlier mistakes are more costly than later ones. We refer to our approach as Optimistic Adaptive Submodular Maximization (OASM) because it trades off exploration and exploitation based on the optimism in the face of uncertainty principle. We evaluate our method on a preference elicitation problem and show that non-trivial K-step policies can be learned from just a few hundred interactions with the problem.", "full_text": "Adaptive Submodular Maximization in Bandit Setting\n\nVictor Gabillon\n\nINRIA Lille - team SequeL\nVilleneuve d\u2019Ascq, France\nvictor.gabillon@inria.fr\n\nBranislav Kveton\nTechnicolor Labs\n\nPalo Alto, CA\n\nbranislav.kveton@technicolor.com\n\nZheng Wen\n\nElectrical Engineering Department\n\nStanford University\n\nzhengwen@stanford.edu\n\nBrian Eriksson\nTechnicolor Labs\n\nPalo Alto, CA\n\nbrian.eriksson@technicolor.com\n\nS. Muthukrishnan\n\nDepartment of Computer Science\n\nRutgers\n\nmuthu@cs.rutgers.edu\n\nAbstract\n\nMaximization of submodular functions has wide applications in machine learning\nand arti\ufb01cial intelligence. Adaptive submodular maximization has been tradition-\nally studied under the assumption that the model of the world, the expected gain\nof choosing an item given previously selected items and their states, is known. In\nthis paper, we study the setting where the expected gain is initially unknown, and\nit is learned by interacting repeatedly with the optimized function. We propose an\nef\ufb01cient algorithm for solving our problem and prove that its expected cumulative\nregret increases logarithmically with time. Our regret bound captures the inherent\nproperty of submodular maximization, earlier mistakes are more costly than later\nones. We refer to our approach as Optimistic Adaptive Submodular Maximization\n(OASM) because it trades off exploration and exploitation based on the optimism in\nthe face of uncertainty principle. We evaluate our method on a preference elicita-\ntion problem and show that non-trivial K-step policies can be learned from just a\nfew hundred interactions with the problem.\n\n1\n\nIntroduction\n\nMaximization of submodular functions [14] has wide applications in machine learning and arti\ufb01cial\nintelligence, such as social network analysis [9], sensor placement [10], and recommender systems\n[7, 2]. In this paper, we study the problem of adaptive submodular maximization [5]. This problem\nis a variant of submodular maximization where each item has a state and this state is revealed when\nthe item is chosen. The goal is to learn a policy that maximizes the expected return for choosing K\nitems.\nAdaptive submodular maximization has been traditionally studied in the setting where the model of\nthe world, the expected gain of choosing an item given previously selected items and their states, is\nknown. This is the \ufb01rst paper that studies the setting where the model is initially unknown, and it is\nlearned by interacting repeatedly with the environment. We bring together the concepts of adaptive\nsubmodular maximization and bandits, and the result is an ef\ufb01cient solution to our problem.\nWe make four major contributions. First, we propose a model where the expected gain of choosing\nan item can be learned ef\ufb01ciently. The main assumption in the model is that the state of each item is\ndistributed independently of the other states. Second, we propose Optimistic Adaptive Submodular\nMaximization (OASM), a bandit algorithm that selects items with the highest upper con\ufb01dence bound\non the expected gain. This algorithm is computationally ef\ufb01cient and easy to implement. Third, we\nprove that the expected cumulative regret of our algorithm increases logarithmically with time. Our\nregret bound captures the inherent property of adaptive submodular maximization, earlier mistakes\nare more costly than later ones. Finally, we apply our approach to a real-world preference elicitation\n\n1\n\n\fproblem and show that non-trivial policies can be learned from just a few hundred interactions with\nthe problem.\n\n2 Adaptive Submodularity\n\nIn adaptive submodular maximization, the objective is to maximize, under constraints, a function of\nthe form:\n\nf : 2I \u00d7 {\u22121, 1}L \u2192 R,\n\n(1)\nwhere I = {1, . . . , L} is a set of L items and 2I is its power set. The \ufb01rst argument of f is a subset\nof chosen items A \u2286 I. The second argument is the state \u03c6 \u2208 {\u22121, 1}L of all items. The i-th entry\nof \u03c6, \u03c6[i], is the state of item i. The state \u03c6 is drawn i.i.d. from some probability distribution P (\u03a6).\nThe reward for choosing items A in state \u03c6 is f (A, \u03c6). For simplicity of exposition, we assume that\nf (\u2205, \u03c6) = 0 in all \u03c6. In problems of our interest, the state is only partially observed. To capture this\nphenomenon, we introduce the notion of observations. An observation is a vector y \u2208 {\u22121, 0, 1}L\nwhose non-zero entries are the observed states of items. We say that y is an observation of state \u03c6,\nand write \u03c6 \u223c y, if y[i] = \u03c6[i] in all non-zero entries of y. Alternatively, the state \u03c6 can be viewed\nas a realization of y, one of many. We denote by dom (y) = {i : y[i] (cid:54)= 0} the observed items in y\nand by \u03c6(cid:104)A(cid:105) the observation of items A in state \u03c6. We de\ufb01ne a partial ordering on observations and\nwrite y(cid:48) (cid:23) y if y(cid:48)[i] = y[i] in all non-zero entries of y, y(cid:48) is a more speci\ufb01c observation than y. In\nthe terminology of Golovin and Krause [5], y is a subrealization of y(cid:48).\nWe illustrate our notation on a simple example. Let \u03c6 = (1, 1,\u22121) be a state, and y1 = (1, 0, 0) and\ny2 = (1, 0,\u22121) be observations. Then all of the following claims are true:\n\n\u03c6 \u223c y1, \u03c6 \u223c y2, y2 (cid:23) y1, dom (y2) = {1, 3} , \u03c6(cid:104){1, 3}(cid:105) = y2, \u03c6(cid:104)dom (y1)(cid:105) = y1.\n\nOur goal is to maximize the expected value of f by adaptively choosing K items. This problem can\nbe viewed as a K step game, where at each step we choose an item according to some policy \u03c0 and\nthen observe its state. A policy \u03c0 : {\u22121, 0, 1}L \u2192 I is a function from observations y to items. The\nobservations represent our past decisions and their outcomes. A k-step policy in state \u03c6, \u03c0k(\u03c6), is a\ncollection of the \ufb01rst k items chosen by policy \u03c0. The policy is de\ufb01ned recursively as:\n\n\u03c0k(\u03c6) = \u03c0k\u22121(\u03c6) \u222a(cid:8)\u03c0[k](\u03c6)(cid:9) ,\n\n\u03c0[k](\u03c6) = \u03c0(\u03c6(cid:104)\u03c0k\u22121(\u03c6)(cid:105)),\n\n\u03c00(\u03c6) = \u2205,\n\n(2)\n\nwhere \u03c0[k](\u03c6) is the k-th item chosen by policy \u03c0 in state \u03c6. The optimal K-step policy satis\ufb01es:\n\n\u03c0\u2217 = arg max\u03c0 E\u03c6[f (\u03c0K(\u03c6), \u03c6)] .\n\n(3)\nIn general, the problem of computing \u03c0\u2217 is NP-hard [14, 5]. However, near-optimal policies can be\ncomputed ef\ufb01ciently when the maximized function has a diminishing return property. Formally, we\nrequire that the function is adaptive submodular and adaptive monotonic [5].\nDe\ufb01nition 1. Function f is adaptive submodular if:\n\nE\u03c6[ f (A \u222a {i} , \u03c6) \u2212 f (A, \u03c6)| \u03c6 \u223c yA ] \u2265 E\u03c6[ f (B \u222a {i} , \u03c6) \u2212 f (B, \u03c6)| \u03c6 \u223c yB ]\n\nfor all items i \u2208 I \\ B and observations yB (cid:23) yA, where A = dom (yA) and B = dom (yB).\nDe\ufb01nition 2. Function f is adaptive monotonic if E\u03c6[ f (A \u222a {i} , \u03c6) \u2212 f (A, \u03c6)| \u03c6 \u223c yA ] \u2265 0 for\nall items i \u2208 I \\ A and observations yA, where A = dom (yA).\nIn other words, the expected gain of choosing an item is always non-negative and does not increase\nas the observations become more speci\ufb01c.\nLet \u03c0g be the greedy policy for maximizing f, a policy that always selects the item with the highest\nexpected gain:\n\n\u03c0g(y) = arg max\ni\u2208I\\dom(y)\n\ngi(y),\n\n(4)\n\nwhere:\n\ngi(y) = E\u03c6[ f (dom (y) \u222a {i} , \u03c6) \u2212 f (dom (y) , \u03c6)| \u03c6 \u223c y ]\n\n(5)\nis the expected gain of choosing item i after observing y. Then, based on the result of Golovin and\nKrause [5], \u03c0g is a (1 \u2212 1/e)-approximation to \u03c0\u2217, E\u03c6[f (\u03c0g\nK(\u03c6), \u03c6)],\nif f is adaptive submodular and adaptive monotonic. In the rest of this paper, we say that an obser-\nvation y is a context if it can be observed under the greedy policy \u03c0g. Speci\ufb01cally, there exist k and\n\u03c6 such that y = \u03c6(cid:104)\u03c0g\n\nK(\u03c6), \u03c6)] \u2265 (1 \u2212 1/e)E\u03c6[f (\u03c0\u2217\n\nk(\u03c6)(cid:105).\n\n2\n\n\f3 Adaptive Submodularity in Bandit Setting\n\nThe greedy policy \u03c0g can be computed only if the objective function f and the distribution of states\nP (\u03a6) are known, because both of these quantities are needed to compute the marginal bene\ufb01t gi(y)\n(Equation 5). In practice, the distribution P (\u03a6) is often unknown, for instance in a newly deployed\nsensor network where the failure rates of the sensors are unknown. In this paper, we study a natural\nvariant of adaptive submodular maximization that can model such problems. The distribution P (\u03a6)\nis assumed to be unknown and we learn it by interacting repeatedly with the problem.\n\n3.1 Model\n\nThe problem of learning P (\u03a6) can be cast in many ways. One approach is to directly learn the joint\nP (\u03a6). This approach is not practical for two reasons. First, the number of states \u03c6 is exponential in\nthe number of items L. Second, the state of our problem is observed only partially. As a result, it is\ngenerally impossible to identify the distribution that generates \u03c6. Another possibility is to learn the\nprobability of individual states \u03c6[i] conditioned on context, observations y under the greedy policy\n\u03c0g in up to K steps. This is impractical because the number of contexts is exponential in K.\nClearly, additional structural assumptions are necessary to obtain a practical solution. In this paper,\nwe assume that the states of items are independent of the context in which the items are chosen. In\nparticular, the state \u03c6[i] of each item i is drawn i.i.d. from a Bernoulli distribution with mean pi. In\nthis setting, the joint probability distribution factors as:\n\nL(cid:89)\n\nP (\u03a6 = \u03c6) =\n\n1{\u03c6[i]=1}\np\ni\n\n(1 \u2212 pi)1\u22121{\u03c6[i]=1}\n\n(6)\n\ni=1\n\nand the problem of learning P (\u03a6) reduces to estimating L parameters, the means of the Bernoullis.\nA major question is how restrictive is our independence assumption. We argue that this assumption\nis fairly natural in many applications. For instance, consider a sensor network where the sensors fail\nat random due to manufacturing defects. The failures of these sensors are independent of each other\nand thus can be modeled in our framework. To validate our assumption, we conduct an experiment\n(Section 4) that shows that it does not greatly affect the performance of our method on a real-world\nproblem. Correlations obviously exist and we discuss how to model them in Section 6.\nBased on the independence assumption, we rewrite the expected gain (Equation 5) as:\n\ngi(y) = pi\u00afgi(y),\n\n(7)\n\nwhere:\n\n\u00afgi(y) = E\u03c6[ f (dom (y) \u222a {i} , \u03c6) \u2212 f (dom (y) , \u03c6)| \u03c6 \u223c y, \u03c6[i] = 1 ]\n\n(8)\nis the expected gain when item i is in state 1. For simplicity of exposition, we assume that the gain\nis zero when the item is in state \u22121. We discuss how to relax this assumption in Appendix.\nIn general, the gain \u00afgi(y) depends on P (\u03a6) and thus cannot be computed when P (\u03a6) is unknown.\nIn this paper, we assume that \u00afgi(y) can be computed without knowing P (\u03a6). This scenario is quite\ncommon in practice. In maximum coverage problems, for instance, it is quite reasonable to assume\nthat the covered area is only a function of the chosen items and their states. In other words, the gain\ncan be computed as \u00afgi(y) = f (dom (y) \u222a {i} , \u03c6) \u2212 f (dom (y) , \u03c6), where \u03c6 is any state such that\n\u03c6 \u223c y and \u03c6[i] = 1.\nOur learning problem comprises n episodes. In episode t, we adaptively choose K items according\nto some policy \u03c0t, which may differ from episode to episode. The quality of the policy is measured\nK(\u03c6t), \u03c6t)]. We compare this return\nto that of the greedy policy \u03c0g and measure the difference between the two returns by the expected\ncumulative regret:\n\nby the expected cumulative K-step return E\u03c61,...,\u03c6n [(cid:80)n\n(cid:34) n(cid:88)\n\n(cid:34) n(cid:88)\n\nt=1 f (\u03c0t\n\n(cid:35)\n\nR(n) = E\u03c61,...,\u03c6n\n\nRt(\u03c6t)\n\n= E\u03c61,...,\u03c6n\n\nf (\u03c0g\n\nK(\u03c6t), \u03c6t) \u2212 f (\u03c0t\n\nK(\u03c6t), \u03c6t)\n\n.\n\n(9)\n\n(cid:35)\n\nIn maximum coverage problems, the greedy policy \u03c0g is a good surrogate for the optimal policy \u03c0\u2217\nbecause it is a (1 \u2212 1/e)-approximation to \u03c0\u2217 (Section 2).\n\nt=1\n\nt=1\n\n3\n\n\fAlgorithm 1 OASM: Optimistic adaptive submodular maximization.\n\nInput: States \u03c61, . . . , \u03c6n\nfor all i \u2208 I do Select item i and set \u02c6pi,1 to its state, Ti(0) \u2190 1 end for\nfor all t = 1, 2, . . . , n do\n\nA \u2190 \u2205\nfor all k = 1, 2, . . . , K do\n\n(cid:40)\n\ny \u2190 \u03c6t(cid:104)A(cid:105)\nA \u2190 A \u222a\n\narg max\ni\u2208I\\A\n\nend for\nfor all i \u2208 I do Ti(t) \u2190 Ti(t \u2212 1) end for\nfor all i \u2208 A do\nTi(t) \u2190 Ti(t) + 1\n\u02c6pi,Ti(t) \u2190 1\n\nTi(t) (\u02c6pi,Ti(t\u22121)Ti(t \u2212 1) + 1\n\n2 (\u03c6t[i] + 1))\n\n(\u02c6pi,Ti(t\u22121) + ct\u22121,Ti(t\u22121))\u00afgi(y)\n\n(cid:46) Choose the highest index\n\n(cid:46) Initialization\n\n(cid:46) K-step maximization\n\n(cid:41)\n\n(cid:46) Update statistics\n\nend for\n\nend for\n\n3.2 Algorithm\n\ns(cid:88)\n\nz=1\n\nOur algorithm is designed based on the optimism in the face of uncertainty principle, a strategy that\nis at the core of many bandit algorithms [1, 8, 13]. More speci\ufb01cally, it is a greedy policy where the\nexpected gain gi(y) (Equation 7) is substituted for its optimistic estimate. The algorithm adaptively\nmaximizes a submodular function in an optimistic fashion and therefore we refer to it as Optimistic\nAdaptive Submodular Maximization (OASM).\nThe pseudocode of our method is given in Algorithm 1. In each episode, we maximize the function\nf in K steps. At each step, we compute the index (\u02c6pi,Ti(t\u22121) + ct\u22121,Ti(t\u22121))\u00afgi(y) of each item that\nhas not been selected yet and then choose the item with the highest index. The terms \u02c6pi,Ti(t\u22121) and\nct\u22121,Ti(t\u22121) are the maximum-likelihood estimate of the probability pi from the \ufb01rst t \u2212 1 episodes\nand the radius of the con\ufb01dence interval around this estimate, respectively. Formally:\n\n(cid:114)\n\n\u02c6pi,s =\n\n1\ns\n\n1\n2\n\n(\u03c6\u03c4 (i,z)[i] + 1),\n\nct,s =\n\n2 log(t)\n\ns\n\n,\n\n(10)\n\nwhere s is the number of times that item i is chosen and \u03c4 (i, z) is the index of the episode in which\nitem i is chosen for the z-th time. In episode t, we set s to Ti(t \u2212 1), the number of times that item\ni is selected in the \ufb01rst t \u2212 1 episodes. The radius ct,s is designed such that each index is with high\nprobability an upper bound on the corresponding gain. The index enforces exploration of items that\nhave not been chosen very often. As the number of past episodes increases, all con\ufb01dence intervals\nshrink and our method starts exploiting most pro\ufb01table items. The log(t) term guarantees that each\nitem is explored in\ufb01nitely often as t \u2192 \u221e, to avoid linear regret.\nAlgorithm OASM has several notable properties. First, it is a greedy method. Therefore, our policies\ncan be computed very fast. Second, it is guaranteed to behave near optimally as our estimates of the\ngain gi(y) become more accurate. We prove this claim in Section 3.3. Finally, our algorithm learns\nonly L parameters and therefore is quite practical. Speci\ufb01cally, note that if an item is chosen in one\ncontext, it helps in re\ufb01ning the estimate of the gain gi(y) in all other contexts.\n\n3.3 Analysis\n\nIn this section, we prove an upper bound on the expected cumulative regret of Algorithm OASM in n\nepisodes. Before we present the main result, we de\ufb01ne notation used in our analysis. We denote by\ni\u2217(y) = \u03c0g(y) the item chosen by the greedy policy \u03c0g in context y. Without loss of generality, we\nassume that this item is unique in all contexts. The hardness of discriminating between items i and\ni\u2217(y) is measured by a gap between the expected gains of the items:\n\n(11)\nOur analysis is based on counting how many times the policies \u03c0t and \u03c0g choose a different item at\nstep k. Therefore, we de\ufb01ne several variables that describe the state of our problem at this step. We\n\n\u2206i(y) = gi\u2217(y)(y) \u2212 gi(y).\n\n4\n\n\f(cid:96)i\n\nGk\u03b1i,k\n\n+\n\n\u03c02L(L + 1)\n\n(12)\n\ndenote by Yk(\u03c0) =(cid:83)\n\n\u03c6 {\u03c6(cid:104)\u03c0k\u22121(\u03c6)(cid:105)} the set of all possible observations after policy \u03c0 is executed\nfor k \u2212 1 steps. We write Yk = Yk(\u03c0g) and Y t\nk = Yk(\u03c0t) when we refer to the policies \u03c0g and \u03c0t,\nrespectively. Finally, we denote by Yk,i = Yk \u2229 {y : i (cid:54)= i\u2217(y)} the set of contexts where item i is\nsuboptimal at step k.\nOur main result is Theorem 1. Supplementary material for its proof is in Appendix. The terms item\nand arm are treated as synonyms, and we use whichever is more appropriate in a given context.\nTheorem 1. The expected cumulative regret of Algorithm OASM is bounded as:\n\nK(cid:88)\n(cid:123)(cid:122)\n\nk=1\n\nR(n) \u2264 L(cid:88)\n(cid:124)\n(cid:24)\n\ni=1\n\ni\n\nmax\n\n2\n3\n\n(cid:124)\n\n(cid:125)\n\n(cid:25)\n\nK(cid:88)\n\nk=1\n\nGk\n\n,\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nwhere Gk = (K \u2212 k + 1) max\ny\u2208Yk\n\nO(log n)\ngi(y) is an upper bound on the expected gain of the policy \u03c0g\n\nO(1)\n\nfrom step k forward, (cid:96)i,k =\n\n\u00afg2\ni (y)\ni (y) log n\n\u22062\nlikely to be pulled suboptimally at step k, (cid:96)i = max\n\nis a weight that associates the regret of arm i to step k such that(cid:80)K\n\n(cid:96)i,k, and \u03b1i,k =\n\n8 max\ny\u2208Yk,i\n\nk\n\n1\n(cid:96)i\n\nis the number of pulls after which arm i is not\n\n(cid:2)(cid:96)i,k \u2212 max\n\nk(cid:48)<k\n\n(cid:96)i,k(cid:48)(cid:3)+ \u2208 [0, 1]\n\nk=1 \u03b1i,k = 1.\n\nProof. Our theorem is proved in three steps. First, we associate the regret in episode t with the \ufb01rst\nstep where our policy \u03c0t selects a different item from the greedy policy \u03c0g. For simplicity, suppose\nthat this step is step k. Then the regret in episode t can be written as:\n\nRt(\u03c6t) = f (\u03c0g\n= f (\u03c0g\n\n(cid:124)\n\nK(\u03c6t), \u03c6t) \u2212 f (\u03c0t\n(cid:123)(cid:122)\nK(\u03c6t), \u03c6t) \u2212 f (\u03c0g\n\nF g\n\nk\u2192(\u03c6t)\n\nK(\u03c6t), \u03c6t)\nk\u22121(\u03c6t), \u03c6t)\n\n(cid:125)\n\n\u2212[f (\u03c0t\n\n(cid:124)\n\n(cid:123)(cid:122)\nK(\u03c6t), \u03c6t) \u2212 f (\u03c0t\n\nwhere the last equality is due to the assumption that \u03c0t\nk\u2192(\u03c6t)\nand F t\nk\u2192(\u03c6t) are the gains of the policies \u03c0g and \u03c0t, respectively, in state \u03c6t from step k forward.\nIn practice, the \ufb01rst step where the policies \u03c0t and \u03c0g choose a different item is unknown, because\n\u03c0g is unknown. In this case, the regret can be written as:\n\n[j](\u03c6t) = \u03c0g\n\nk\u2192(\u03c6t)\n\nF t\n[j](\u03c6t) for all j < k; and F g\n\nk\u22121(\u03c6t), \u03c6t)\n\n],\n\n(13)\n\n1i,k,t(\u03c6t)(F g\n\nk\u2192(\u03c6t)),\n\n(14)\n\nRt(\u03c6t) =\n\nK(cid:88)\nL(cid:88)\n(cid:110)(cid:16)\u2200j < k : \u03c0t\n\nk=1\n\ni=1\n\nk\u2192(\u03c6t) \u2212 F t\n(cid:17)\n\n[k](\u03c6) (cid:54)= \u03c0g\n\nwhere:\n\n, \u03c0t\n\n[j](\u03c6)\n\n[j](\u03c6) = \u03c0g\n\n1i,k,t(\u03c6) = 1\n\n(15)\nis the indicator of the event that the policies \u03c0t and \u03c0g choose the same \ufb01rst k \u2212 1 items in state \u03c6,\ndisagree in the k-th item, and i is the k-th item chosen by \u03c0t. The commas in the indicator function\nrepresent logical conjunction.\nSecond, in Lemma 1 we bound the expected loss associated with choosing the \ufb01rst different item at\nstep k by the probability of this event and an upper bound on the expected loss Gk, which does not\ndepend on \u03c0t and \u03c6t. Based on this result, we bound the expected cumulative regret as:\n\n[k](\u03c6), \u03c0t\n\n[k](\u03c6) = i\n\n(cid:34) n(cid:88)\n\nt=1\n\nE\u03c61,...,\u03c6n\n\nRt(\u03c6t)\n\n(cid:125)\n\n(cid:111)\n\n1i,k,t(\u03c6t)(F g\n\nk\u2192(\u03c6t) \u2212 F t\n\nk\u2192(\u03c6t))\n\n(cid:2)1i,k,t(\u03c6t)(F g\n\nk\u2192(\u03c6t) \u2212 F t\n\n(cid:35)\nk\u2192(\u03c6t))(cid:3)(cid:3)\n\nE\u03c61,...,\u03c6t\u22121[E\u03c6t[1i,k,t(\u03c6t)] Gk]\n\n(cid:35)\n\n1i,k,t(\u03c6t)\n\n.\n\n(16)\n\n(cid:35)\n\nL(cid:88)\n\nK(cid:88)\n\ni=1\n\nk=1\n\nE\u03c61,...,\u03c6t\u22121\n\n(cid:34) n(cid:88)\nn(cid:88)\nn(cid:88)\n\nt=1\n\nt=1\n\nt=1\n\nGkE\u03c61,...,\u03c6n\n\n(cid:2)E\u03c6t\n(cid:34) n(cid:88)\n\n= E\u03c61,...,\u03c6n\n\n=\n\ni=1\n\nL(cid:88)\n\u2264 L(cid:88)\nL(cid:88)\n\ni=1\n\n=\n\nk=1\n\nK(cid:88)\nK(cid:88)\nK(cid:88)\n\nk=1\n\ni=1\n\nk=1\n\nt=1\n\n5\n\n\fFinally, motivated by the analysis of UCB1 [1], we rewrite the indicator 1i,k,t(\u03c6t) as:\n\n1i,k,t(\u03c6t) = 1i,k,t(\u03c6t)1{Ti(t \u2212 1) \u2264 (cid:96)i,k} + 1i,k,t(\u03c6t)1{Ti(t \u2212 1) > (cid:96)i,k} ,\n\n(17)\nwhere (cid:96)i,k is a problem-speci\ufb01c constant. In Lemma 4, we show how to choose (cid:96)i,k such that arm i\nat step k is pulled suboptimally a constant number of times in expectation after (cid:96)i,k pulls. Based on\nthis result, the regret corresponding to the events 1{Ti(t \u2212 1) > (cid:96)i,k} is bounded as:\n\nGkE\u03c61,...,\u03c6n\n\n1i,k,t(\u03c6t)1{Ti(t \u2212 1) > (cid:96)i,k}\n\n\u2264 2\n3\n\n\u03c02L(L + 1)\n\nGk.\n\n(18)\n\nOn the other hand, the regret associated with the events 1{Ti(t \u2212 1) \u2264 (cid:96)i,k} is trivially bounded by\n\ni=1\n\nk=1 Gk(cid:96)i,k. A tighter upper bound is proved below:\n\n(cid:34) n(cid:88)\n\nt=1\n\nL(cid:88)\nK(cid:88)\n(cid:80)K\n\nk=1\n\ni=1\n\n(cid:80)L\n\n(cid:35)\n\nK(cid:88)\n\nk=1\n\n(cid:35)\n\n(cid:35)\n\nE\u03c61,...,\u03c6n\n\n1i,k,t(\u03c6t)1{Ti(t \u2212 1) \u2264 (cid:96)i,k}\n\n1i,k,t(\u03c6t)1{Ti(t \u2212 1) \u2264 (cid:96)i,k}\n\ni=1\n\nL(cid:88)\n\u2264 L(cid:88)\n\u2264 L(cid:88)\n\ni=1\n\nmax\n\n\u03c61,...,\u03c6n\n\nK(cid:88)\n\nGk\n\ni=1\n\nk=1\n\nGk\n\n(cid:34) K(cid:88)\nn(cid:88)\n(cid:34) K(cid:88)\n(cid:20)\n\nGk\n\nk=1\n\nk=1\n\nt=1\n\nn(cid:88)\n\nt=1\n\n(cid:96)i,k \u2212 max\nk(cid:48)<k\n\n(cid:96)i,k(cid:48)\n\n(cid:21)+\n\n.\n\n(19)\n\nThe last inequality can be proved as follows. Our upper bound on the expected loss at step k, Gk, is\nmonotonically decreasing with k, and therefore G1 \u2265 G2 \u2265 . . . \u2265 GK. So for any given arm i, the\nhighest cumulative regret subject to the constraint Ti(t \u2212 1) \u2264 (cid:96)i,k at step k is achieved as follows.\nThe \ufb01rst (cid:96)i,1 mistakes are made at the \ufb01rst step, [(cid:96)i,2 \u2212 (cid:96)i,1]+ mistakes are made at the second step,\n[(cid:96)i,3 \u2212 max{(cid:96)i,1, (cid:96)i,2}]+ mistakes are made at the third step, and so on. Speci\ufb01cally, the number of\nmistakes at step k is [(cid:96)i,k \u2212 maxk(cid:48)<k (cid:96)i,k(cid:48)]+ and the associated loss is Gk.\nOur main claim follows from combining the upper bounds in Equations 18 and 19.\n\n3.4 Discussion of Theoretical Results\n\nAlgorithm OASM mimics the greedy policy \u03c0g. Therefore, we decided to prove Theorem 1 based on\ncounting how many times the policies \u03c0t and \u03c0g choose a different item. Our proof has three parts.\nFirst, we associate the regret in episode t with the \ufb01rst step where the policy \u03c0t chooses a different\nitem from \u03c0g. Second, we bound the expected regret in each episode by the probability of deviating\nfrom the policy \u03c0g at step k and an upper bound on the associated loss Gk, which depends only on\nk. Finally, we divide the expected cumulative regret into two terms, before and after item i at step k\nis selected a suf\ufb01cient number of times (cid:96)i,k, and then set (cid:96)i,k such that both terms are O(log n). We\nwould like to stress that our proof is relatively general. Our modeling assumptions (Section 3.1) are\nleveraged only in Lemma 4. In the rest of the proof, we only assume that f is adaptive submodular\nand adaptive monotonic.\nOur regret bound has several notable properties. First, it is logarithmic in the number of episodes n,\nthrough problem-speci\ufb01c constants (cid:96)i,k. So we recover a classical result from the bandit literature.\nSecond, the bound is polynomial in all constants of interest, such as the number of items L and the\nnumber of maximization steps K in each episode. We would like to stress that it is not linear in the\nnumber of contexts YK at step K, which is exponential in K. Finally, note that our bound captures\nthe shape of the optimized function f. In particular, because the function f is adaptive submodular,\nthe upper bound on the gain of the policy \u03c0g from step k forward, Gk, decreases as k increases. As\na result, earlier deviations from \u03c0g are penalized more than later ones.\n\n4 Experiments\n\nOur algorithm is evaluated on a preference elicitation problem in a movie recommendation domain.\nThis problem is cast as asking K yes-or-no movie-genre questions. The users and their preferences\nare extracted from the MovieLens dataset [11], a dataset of 6k users who rated one million movies.\n\n6\n\n\fGenre\nCrime\nChildren\u2019s\nAnimation\nHorror\nSci-Fi\nMusical\nFantasy\nAdventure\n\ngi(0)\n4.1% 13.0%\n4.1% 9.2%\n3.2% 6.6%\n3.0% 8.0%\n2.8% 23.0%\n2.6% 6.0%\n2.6% 5.8%\n2.3% 19.6%\n\n\u00afgi(0) P (\u03c6[i] = 1)\n0.32\n0.44\n0.48\n0.38\n0.12\n0.44\n0.44\n0.12\n\nFigure 1: Left. Eight movie genres that cover the largest number of movies in expectation. Right.\nComparison of three greedy policies for solving our preference elicitation problem. For each policy\nand K \u2264 L, we report the expected percentage of covered movies after K questions.\n\nFigure 2: The expected return of the OASM policy \u03c0t (cyan lines) in all episodes up to t = 105. The\nreturn is compared to those of the greedy policies \u03c0g (blue lines), \u03c0g\nd (gray lines)\nin the of\ufb02ine setting (Figure 1) at the same operating point, the number of asked questions K.\n\nf (red lines), and \u03c0g\n\n(cid:17)\n\n(cid:16) nu\n\nWe choose 500 most rated movies from the dataset. Each movie l is represented by a feature vector\nxl such that xl[i] = 1 if the movie belongs to genre i and xl[i] = 0 if it does not. The preference of\nuser j for genre i is measured by tf-idf, a popular importance score in information retrieval [12]. In\nparticular, it is de\ufb01ned as tf-idf(j, i) = #(j, i) log\n, where #(j, i) is the number of movies\nfrom genre i rated by user j, nu is the number of users, and #(\u00b7, i) is the number of users that rated\nat least one movie from genre i. Intuitively, this score prefers genres that are often rated by the user\nbut rarely rated overall. Each user j is represented by a genre preference vector \u03c6 such that \u03c6[i] = 1\nwhen genre i is among \ufb01ve most favorite genres of the user. These genres cover on average 25% of\nour movies. In Figure 1, we show several popular genres from our dataset.\nThe reward for asking user \u03c6 questions A is:\n\n#(\u00b7,i)\n\n(cid:80)500\nl=1 maxi [xl[i]1{\u03c6[i] = 1} 1{i \u2208 A}] ,\n\nf (A, \u03c6) = 1\n5\n\n(20)\nthe percentage of movies that belong to at least one genre i that is preferred by the user and queried\nin A. The function f captures the notion that knowing more preferred genres is better than knowing\nless. It is submodular in A for any given preference vector \u03c6, and therefore adaptive submodular in\nA when the preferences are distributed independently of each other (Equation 6). In this setting, the\nexpected value of f can be maximized near optimally by a greedy policy (Equation 4).\nIn the \ufb01rst experiment, we show that our assumption on P (\u03a6) (Equation 6) is not very restrictive in\nour domain. We compare three greedy policies for maximizing f that know P (\u03a6) and differ in how\nthe expected gain of choosing items is estimated. The \ufb01rst policy \u03c0g makes no assumption on P (\u03a6)\nand computes the gain according to Equation 5. The second policy \u03c0g\nf assumes that the distribution\nP (\u03a6) is factored and computes the gain using Equation 7. Finally, the third policy \u03c0g\nd computes the\ngain according to Equation 8, essentially ignoring the stochasticity of our problem. All policies are\napplied to all users in our dataset for all K \u2264 L and their expected returns are reported in Figure 1.\nWe observe two trends. First, the policy \u03c0g\nd by a large margin. So\nalthough our independence assumption may be incorrect, it is a better approximation than ignoring\n\nf usually outperforms the policy \u03c0g\n\n7\n\n246810121416180102030Number of questions KCovered movies [%]  \u03c0gDeterministic \u03c0dgFactored \u03c0fg1011021031041050246810K = 2Episode tCovered movies [%]101102103104105051015K = 4Episode t101102103104105510152025K = 8Episode t\ff is a good approximation to \u03c0g.\n\nthe stochastic nature of the problem. Second, the expected return of \u03c0g\nWe conclude that \u03c0g\nIn the second experiment, we study how the OASM policy \u03c0t improves over time. In each episode t,\nwe randomly choose a new user \u03c6t and then the policy \u03c0t asks K questions. The expected return of\n\u03c0t is compared to two of\ufb02ine baselines, \u03c0g\nd can be viewed as upper\nand lower bounds on the expected return of \u03c0t, respectively. Our results are shown in Figure 2. We\nobserve two major trends. First, \u03c0t easily outperforms the baseline \u03c0g\nd that ignores the stochasticity\nof our problem. In two cases, this happens in less than ten episodes. Second, the expected return of\n\u03c0t approaches that of \u03c0g\n\nf is always within 84% of \u03c0g.\n\nf , as is expected based on our analysis.\n\nf and \u03c0g\n\nd. The policies \u03c0g\n\nf and \u03c0g\n\n5 Related Work\n\nOur paper is motivated by prior work in the areas of submodularity [14, 5] and bandits [1]. Similar\nproblems to ours were studied by several authors. For instance, Yue and Guestrin [17], and Guillory\nand Bilmes [6], applied bandits to submodular problems in a non-adaptive setting. In our work, we\nfocus on the adaptive setting. This setting is more challenging because we learn a K-step policy for\nchoosing items, as opposing to a single set of items. Wen et al. [16] studied a variant of generalized\nbinary search, sequential Bayesian search, where the policy for asking questions is learned on-the-\n\ufb02y by interacting with the environment. A major observation of Wen et al. [16] is that this problem\ncan be solved near optimally without exploring. As a result, its solution and analysis are completely\ndifferent from those in our paper.\nLearning with trees was studied in machine learning in many settings, such as online learning with\ntree experts [3]. This work is similar to ours only in trying to learn a tree. The notions of regret and\nthe assumptions on solved problems are completely different. Optimism in the face of uncertainty is\na popular approach to designing learning algorithms, and it was previously applied to more general\nproblems than ours, such as planning [13] and MDPs [8]. Both of these solutions are impractical in\nour setting. The former assumes that the model of the world is known and the latter is computation-\nally intractable.\n\n6 Conclusions\n\nThis is the \ufb01rst work that studies adaptive submodular maximization in the setting where the model\nof the world is initially unknown. We propose an ef\ufb01cient bandit algorithm for solving the problem\nand prove that its expected cumulative regret increases logarithmically with time. Our work can be\nviewed as reinforcement learning (RL) [15] for adaptive submodularity. The main difference in our\nsetting is that we can learn near-optimal policies without estimating the value function. Learning of\nvalue functions is typically hard, even when the model of the problem is known. Fortunately, this is\nnot necessary in our problem and therefore we can develop a very ef\ufb01cient learning algorithm.\nWe assume that the states of items are distributed independently of each other. In our experiments,\nthis assumption was less restrictive than we expected (Section 4). Nevertheless, we believe that our\napproach should be studied under less restrictive assumptions. In preference elicitation (Section 4),\nfor instance, the answers to questions are likely to be correlated due to many factors, such as user\u2019s\npreferences, user\u2019s mood, and the similarity of the questions. Our current model cannot capture any\nof these dependencies. However, we believe that our approach is quite general and can be extended\nto more complex models. We think that any such generalization would comprise three major steps:\nchoosing a model of P (\u03a6), deriving a corresponding upper con\ufb01dence bound on the expected gain,\nand \ufb01nally proving an equivalent of Lemma 4.\nWe also assume that the expected gain of choosing an item (Equation 7) can be written as a product\nof some known gain function (Equation 8) and the probability of the item\u2019s states. This assumption\nis quite natural in maximum coverage problems but may not be appropriate in other problems, such\nas generalized binary search [4].\nOur upper bound on the expected regret at step k (Lemma 1) may be loose in practice because it is\nobtained by maximizing over all contexts y \u2208 Yk. In general, it is dif\ufb01cult to prove a tighter bound.\nSuch a bound would have to depend on the probability of making a mistake in a speci\ufb01c context at\nstep k, which depends on the policy in that episode, and indirectly on the progress of learning in all\nearlier episodes. We leave this for future work.\n\n8\n\n\fReferences\n[1] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine Learning, 47:235\u2013256, 2002.\n\n[2] Sandilya Bhamidipati, Branislav Kveton, and S. Muthukrishnan. Minimal interaction search:\nMulti-way search with item categories. In Proceedings of AAAI Workshop on Intelligent Tech-\nniques for Web Personalization and Recommendation, 2013.\n\n[3] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge Uni-\n\nversity Press, New York, NY, 2006.\n\n[4] Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Infor-\n\nmation Processing Systems 17, pages 337\u2013344, 2005.\n\n[5] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in ac-\ntive learning and stochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 42:427\u2013\n486, 2011.\n\n[6] Andrew Guillory and Jeff Bilmes. Online submodular set cover, ranking, and repeated active\nlearning. In Advances in Neural Information Processing Systems 24, pages 1107\u20131115, 2011.\n[7] Andrew Guillory and Jeff Bilmes. Simultaneous learning and covering with adversarial noise.\nIn Proceedings of the 28th International Conference on Machine Learning, pages 369\u2013376,\n2011.\n\n[8] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[9] David Kempe, Jon Kleinberg, and \u00b4Eva Tardos. Maximizing the spread of in\ufb02uence through a\nsocial network. In Proceedings of the 9th ACM SIGKDD International Conference on Knowl-\nedge Discovery and Data Mining, pages 137\u2013146, 2003.\n\n[10] Andreas Krause, Ajit Paul Singh, and Carlos Guestrin. Near-optimal sensor placements in\nGaussian processes: Theory, ef\ufb01cient algorithms and empirical studies. Journal of Machine\nLearning Research, 9:235\u2013284, 2008.\n\n[11] Shyong Lam and Jon Herlocker. MovieLens 1M Dataset. http://www.grouplens.org/node/12,\n\n2012.\n\n[12] Christopher Manning, Prabhakar Raghavan, and Hinrich Sch\u00a8utze. Introduction to Information\n\nRetrieval. Cambridge University Press, New York, NY, 2008.\n\n[13] R\u00b4emi Munos. The optimistic principle applied to games, optimization, and planning: Towards\nfoundations of Monte-Carlo tree search. Foundations and Trends in Machine Learning, 2012.\n[14] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maxi-\n\nmizing submodular set functions - I. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[15] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press,\n\nCambridge, MA, 1998.\n\n[16] Zheng Wen, Branislav Kveton, Brian Eriksson, and Sandilya Bhamidipati. Sequential Bayesian\nsearch. In Proceedings of the 30th International Conference on Machine Learning, pages 977\u2013\n983, 2013.\n\n[17] Yisong Yue and Carlos Guestrin. Linear submodular bandits and their application to diversi\ufb01ed\nretrieval. In Advances in Neural Information Processing Systems 24, pages 2483\u20132491, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1260, "authors": [{"given_name": "Victor", "family_name": "Gabillon", "institution": "INRIA"}, {"given_name": "Branislav", "family_name": "Kveton", "institution": "Technicolor Labs"}, {"given_name": "Zheng", "family_name": "Wen", "institution": "Stanford University"}, {"given_name": "Brian", "family_name": "Eriksson", "institution": "Technicolor Labs"}, {"given_name": "S.", "family_name": "Muthukrishnan", "institution": "Rutgers University"}]}