{"title": "Adaptive Submodular Maximization in Bandit Setting", "book": "Advances in Neural Information Processing Systems", "page_first": 2697, "page_last": 2705, "abstract": "Maximization of submodular functions has wide applications in machine learning and artificial intelligence. Adaptive submodular maximization has been traditionally studied under the assumption that the model of the world, the expected gain of choosing an item given previously selected items and their states, is known. In this paper, we study the scenario where the expected gain is initially unknown and it is learned by interacting repeatedly with the optimized function. We propose an efficient algorithm for solving our problem and prove that its expected cumulative regret increases logarithmically with time. Our regret bound captures the inherent property of submodular maximization, earlier mistakes are more costly than later ones. We refer to our approach as Optimistic Adaptive Submodular Maximization (OASM) because it trades off exploration and exploitation based on the optimism in the face of uncertainty principle. We evaluate our method on a preference elicitation problem and show that non-trivial K-step policies can be learned from just a few hundred interactions with the problem.", "full_text": "Adaptive Submodular Maximization in Bandit Setting\n\nVictor Gabillon\n\nINRIA Lille - team SequeL\nVilleneuve d\u2019Ascq, France\nvictor.gabillon@inria.fr\n\nBranislav Kveton\nTechnicolor Labs\n\nPalo Alto, CA\n\nbranislav.kveton@technicolor.com\n\nZheng Wen\n\nElectrical Engineering Department\n\nStanford University\n\nzhengwen@stanford.edu\n\nBrian Eriksson\nTechnicolor Labs\n\nPalo Alto, CA\n\nbrian.eriksson@technicolor.com\n\nS. Muthukrishnan\n\nDepartment of Computer Science\n\nRutgers\n\nmuthu@cs.rutgers.edu\n\nAbstract\n\nMaximization of submodular functions has wide applications in machine learning\nand arti\ufb01cial intelligence. Adaptive submodular maximization has been tradition-\nally studied under the assumption that the model of the world, the expected gain\nof choosing an item given previously selected items and their states, is known. In\nthis paper, we study the setting where the expected gain is initially unknown, and\nit is learned by interacting repeatedly with the optimized function. We propose an\nef\ufb01cient algorithm for solving our problem and prove that its expected cumulative\nregret increases logarithmically with time. Our regret bound captures the inherent\nproperty of submodular maximization, earlier mistakes are more costly than later\nones. We refer to our approach as Optimistic Adaptive Submodular Maximization\n(OASM) because it trades off exploration and exploitation based on the optimism in\nthe face of uncertainty principle. We evaluate our method on a preference elicita-\ntion problem and show that non-trivial K-step policies can be learned from just a\nfew hundred interactions with the problem.\n\n1\n\nIntroduction\n\nMaximization of submodular functions [14] has wide applications in machine learning and arti\ufb01cial\nintelligence, such as social network analysis [9], sensor placement [10], and recommender systems\n[7, 2]. In this paper, we study the problem of adaptive submodular maximization [5]. This problem\nis a variant of submodular maximization where each item has a state and this state is revealed when\nthe item is chosen. The goal is to learn a policy that maximizes the expected return for choosing K\nitems.\nAdaptive submodular maximization has been traditionally studied in the setting where the model of\nthe world, the expected gain of choosing an item given previously selected items and their states, is\nknown. This is the \ufb01rst paper that studies the setting where the model is initially unknown, and it is\nlearned by interacting repeatedly with the environment. We bring together the concepts of adaptive\nsubmodular maximization and bandits, and the result is an ef\ufb01cient solution to our problem.\nWe make four major contributions. First, we propose a model where the expected gain of choosing\nan item can be learned ef\ufb01ciently. The main assumption in the model is that the state of each item is\ndistributed independently of the other states. Second, we propose Optimistic Adaptive Submodular\nMaximization (OASM), a bandit algorithm that selects items with the highest upper con\ufb01dence bound\non the expected gain. This algorithm is computationally ef\ufb01cient and easy to implement. Third, we\nprove that the expected cumulative regret of our algorithm increases logarithmically with time. Our\nregret bound captures the inherent property of adaptive submodular maximization, earlier mistakes\nare more costly than later ones. Finally, we apply our approach to a real-world preference elicitation\n\n1\n\n\fproblem and show that non-trivial policies can be learned from just a few hundred interactions with\nthe problem.\n\n2 Adaptive Submodularity\n\nIn adaptive submodular maximization, the objective is to maximize, under constraints, a function of\nthe form:\n\nf : 2I \u00d7 {\u22121, 1}L \u2192 R,\n\n(1)\nwhere I = {1, . . . , L} is a set of L items and 2I is its power set. The \ufb01rst argument of f is a subset\nof chosen items A \u2286 I. The second argument is the state \u03c6 \u2208 {\u22121, 1}L of all items. The i-th entry\nof \u03c6, \u03c6[i], is the state of item i. The state \u03c6 is drawn i.i.d. from some probability distribution P (\u03a6).\nThe reward for choosing items A in state \u03c6 is f (A, \u03c6). For simplicity of exposition, we assume that\nf (\u2205, \u03c6) = 0 in all \u03c6. In problems of our interest, the state is only partially observed. To capture this\nphenomenon, we introduce the notion of observations. An observation is a vector y \u2208 {\u22121, 0, 1}L\nwhose non-zero entries are the observed states of items. We say that y is an observation of state \u03c6,\nand write \u03c6 \u223c y, if y[i] = \u03c6[i] in all non-zero entries of y. Alternatively, the state \u03c6 can be viewed\nas a realization of y, one of many. We denote by dom (y) = {i : y[i] (cid:54)= 0} the observed items in y\nand by \u03c6(cid:104)A(cid:105) the observation of items A in state \u03c6. We de\ufb01ne a partial ordering on observations and\nwrite y(cid:48) (cid:23) y if y(cid:48)[i] = y[i] in all non-zero entries of y, y(cid:48) is a more speci\ufb01c observation than y. In\nthe terminology of Golovin and Krause [5], y is a subrealization of y(cid:48).\nWe illustrate our notation on a simple example. Let \u03c6 = (1, 1,\u22121) be a state, and y1 = (1, 0, 0) and\ny2 = (1, 0,\u22121) be observations. Then all of the following claims are true:\n\n\u03c6 \u223c y1, \u03c6 \u223c y2, y2 (cid:23) y1, dom (y2) = {1, 3} , \u03c6(cid:104){1, 3}(cid:105) = y2, \u03c6(cid:104)dom (y1)(cid:105) = y1.\n\nOur goal is to maximize the expected value of f by adaptively choosing K items. This problem can\nbe viewed as a K step game, where at each step we choose an item according to some policy \u03c0 and\nthen observe its state. A policy \u03c0 : {\u22121, 0, 1}L \u2192 I is a function from observations y to items. The\nobservations represent our past decisions and their outcomes. A k-step policy in state \u03c6, \u03c0k(\u03c6), is a\ncollection of the \ufb01rst k items chosen by policy \u03c0. The policy is de\ufb01ned recursively as:\n\n\u03c0k(\u03c6) = \u03c0k\u22121(\u03c6) \u222a(cid:8)\u03c0[k](\u03c6)(cid:9) ,\n\n\u03c0[k](\u03c6) = \u03c0(\u03c6(cid:104)\u03c0k\u22121(\u03c6)(cid:105)),\n\n\u03c00(\u03c6) = \u2205,\n\n(2)\n\nwhere \u03c0[k](\u03c6) is the k-th item chosen by policy \u03c0 in state \u03c6. The optimal K-step policy satis\ufb01es:\n\n\u03c0\u2217 = arg max\u03c0 E\u03c6[f (\u03c0K(\u03c6), \u03c6)] .\n\n(3)\nIn general, the problem of computing \u03c0\u2217 is NP-hard [14, 5]. However, near-optimal policies can be\ncomputed ef\ufb01ciently when the maximized function has a diminishing return property. Formally, we\nrequire that the function is adaptive submodular and adaptive monotonic [5].\nDe\ufb01nition 1. Function f is adaptive submodular if:\n\nE\u03c6[ f (A \u222a {i} , \u03c6) \u2212 f (A, \u03c6)| \u03c6 \u223c yA ] \u2265 E\u03c6[ f (B \u222a {i} , \u03c6) \u2212 f (B, \u03c6)| \u03c6 \u223c yB ]\n\nfor all items i \u2208 I \\ B and observations yB (cid:23) yA, where A = dom (yA) and B = dom (yB).\nDe\ufb01nition 2. Function f is adaptive monotonic if E\u03c6[ f (A \u222a {i} , \u03c6) \u2212 f (A, \u03c6)| \u03c6 \u223c yA ] \u2265 0 for\nall items i \u2208 I \\ A and observations yA, where A = dom (yA).\nIn other words, the expected gain of choosing an item is always non-negative and does not increase\nas the observations become more speci\ufb01c.\nLet \u03c0g be the greedy policy for maximizing f, a policy that always selects the item with the highest\nexpected gain:\n\n\u03c0g(y) = arg max\ni\u2208I\\dom(y)\n\ngi(y),\n\n(4)\n\nwhere:\n\ngi(y) = E\u03c6[ f (dom (y) \u222a {i} , \u03c6) \u2212 f (dom (y) , \u03c6)| \u03c6 \u223c y ]\n\n(5)\nis the expected gain of choosing item i after observing y. Then, based on the result of Golovin and\nKrause [5], \u03c0g is a (1 \u2212 1/e)-approximation to \u03c0\u2217, E\u03c6[f (\u03c0g\nK(\u03c6), \u03c6)],\nif f is adaptive submodular and adaptive monotonic. In the rest of this paper, we say that an obser-\nvation y is a context if it can be observed under the greedy policy \u03c0g. Speci\ufb01cally, there exist k and\n\u03c6 such that y = \u03c6(cid:104)\u03c0g\n\nK(\u03c6), \u03c6)] \u2265 (1 \u2212 1/e)E\u03c6[f (\u03c0\u2217\n\nk(\u03c6)(cid:105).\n\n2\n\n\f3 Adaptive Submodularity in Bandit Setting\n\nThe greedy policy \u03c0g can be computed only if the objective function f and the distribution of states\nP (\u03a6) are known, because both of these quantities are needed to compute the marginal bene\ufb01t gi(y)\n(Equation 5). In practice, the distribution P (\u03a6) is often unknown, for instance in a newly deployed\nsensor network where the failure rates of the sensors are unknown. In this paper, we study a natural\nvariant of adaptive submodular maximization that can model such problems. The distribution P (\u03a6)\nis assumed to be unknown and we learn it by interacting repeatedly with the problem.\n\n3.1 Model\n\nThe problem of learning P (\u03a6) can be cast in many ways. One approach is to directly learn the joint\nP (\u03a6). This approach is not practical for two reasons. First, the number of states \u03c6 is exponential in\nthe number of items L. Second, the state of our problem is observed only partially. As a result, it is\ngenerally impossible to identify the distribution that generates \u03c6. Another possibility is to learn the\nprobability of individual states \u03c6[i] conditioned on context, observations y under the greedy policy\n\u03c0g in up to K steps. This is impractical because the number of contexts is exponential in K.\nClearly, additional structural assumptions are necessary to obtain a practical solution. In this paper,\nwe assume that the states of items are independent of the context in which the items are chosen. In\nparticular, the state \u03c6[i] of each item i is drawn i.i.d. from a Bernoulli distribution with mean pi. In\nthis setting, the joint probability distribution factors as:\n\nL(cid:89)\n\nP (\u03a6 = \u03c6) =\n\n1{\u03c6[i]=1}\np\ni\n\n(1 \u2212 pi)1\u22121{\u03c6[i]=1}\n\n(6)\n\ni=1\n\nand the problem of learning P (\u03a6) reduces to estimating L parameters, the means of the Bernoullis.\nA major question is how restrictive is our independence assumption. We argue that this assumption\nis fairly natural in many applications. For instance, consider a sensor network where the sensors fail\nat random due to manufacturing defects. The failures of these sensors are independent of each other\nand thus can be modeled in our framework. To validate our assumption, we conduct an experiment\n(Section 4) that shows that it does not greatly affect the performance of our method on a real-world\nproblem. Correlations obviously exist and we discuss how to model them in Section 6.\nBased on the independence assumption, we rewrite the expected gain (Equation 5) as:\n\ngi(y) = pi\u00afgi(y),\n\n(7)\n\nwhere:\n\n\u00afgi(y) = E\u03c6[ f (dom (y) \u222a {i} , \u03c6) \u2212 f (dom (y) , \u03c6)| \u03c6 \u223c y, \u03c6[i] = 1 ]\n\n(8)\nis the expected gain when item i is in state 1. For simplicity of exposition, we assume that the gain\nis zero when the item is in state \u22121. We discuss how to relax this assumption in Appendix.\nIn general, the gain \u00afgi(y) depends on P (\u03a6) and thus cannot be computed when P (\u03a6) is unknown.\nIn this paper, we assume that \u00afgi(y) can be computed without knowing P (\u03a6). This scenario is quite\ncommon in practice. In maximum coverage problems, for instance, it is quite reasonable to assume\nthat the covered area is only a function of the chosen items and their states. In other words, the gain\ncan be computed as \u00afgi(y) = f (dom (y) \u222a {i} , \u03c6) \u2212 f (dom (y) , \u03c6), where \u03c6 is any state such that\n\u03c6 \u223c y and \u03c6[i] = 1.\nOur learning problem comprises n episodes. In episode t, we adaptively choose K items according\nto some policy \u03c0t, which may differ from episode to episode. The quality of the policy is measured\nK(\u03c6t), \u03c6t)]. We compare this return\nto that of the greedy policy \u03c0g and measure the difference between the two returns by the expected\ncumulative regret:\n\nby the expected cumulative K-step return E\u03c61,...,\u03c6n [(cid:80)n\n(cid:34) n(cid:88)\n\n(cid:34) n(cid:88)\n\nt=1 f (\u03c0t\n\n(cid:35)\n\nR(n) = E\u03c61,...,\u03c6n\n\nRt(\u03c6t)\n\n= E\u03c61,...,\u03c6n\n\nf (\u03c0g\n\nK(\u03c6t), \u03c6t) \u2212 f (\u03c0t\n\nK(\u03c6t), \u03c6t)\n\n.\n\n(9)\n\n(cid:35)\n\nIn maximum coverage problems, the greedy policy \u03c0g is a good surrogate for the optimal policy \u03c0\u2217\nbecause it is a (1 \u2212 1/e)-approximation to \u03c0\u2217 (Section 2).\n\nt=1\n\nt=1\n\n3\n\n\fAlgorithm 1 OASM: Optimistic adaptive submodular maximization.\n\nInput: States \u03c61, . . . , \u03c6n\nfor all i \u2208 I do Select item i and set \u02c6pi,1 to its state, Ti(0) \u2190 1 end for\nfor all t = 1, 2, . . . , n do\n\nA \u2190 \u2205\nfor all k = 1, 2, . . . , K do\n\n(cid:40)\n\ny \u2190 \u03c6t(cid:104)A(cid:105)\nA \u2190 A \u222a\n\narg max\ni\u2208I\\A\n\nend for\nfor all i \u2208 I do Ti(t) \u2190 Ti(t \u2212 1) end for\nfor all i \u2208 A do\nTi(t) \u2190 Ti(t) + 1\n\u02c6pi,Ti(t) \u2190 1\n\nTi(t) (\u02c6pi,Ti(t\u22121)Ti(t \u2212 1) + 1\n\n2 (\u03c6t[i] + 1))\n\n(\u02c6pi,Ti(t\u22121) + ct\u22121,Ti(t\u22121))\u00afgi(y)\n\n(cid:46) Choose the highest index\n\n(cid:46) Initialization\n\n(cid:46) K-step maximization\n\n(cid:41)\n\n(cid:46) Update statistics\n\nend for\n\nend for\n\n3.2 Algorithm\n\ns(cid:88)\n\nz=1\n\nOur algorithm is designed based on the optimism in the face of uncertainty principle, a strategy that\nis at the core of many bandit algorithms [1, 8, 13]. More speci\ufb01cally, it is a greedy policy where the\nexpected gain gi(y) (Equation 7) is substituted for its optimistic estimate. The algorithm adaptively\nmaximizes a submodular function in an optimistic fashion and therefore we refer to it as Optimistic\nAdaptive Submodular Maximization (OASM).\nThe pseudocode of our method is given in Algorithm 1. In each episode, we maximize the function\nf in K steps. At each step, we compute the index (\u02c6pi,Ti(t\u22121) + ct\u22121,Ti(t\u22121))\u00afgi(y) of each item that\nhas not been selected yet and then choose the item with the highest index. The terms \u02c6pi,Ti(t\u22121) and\nct\u22121,Ti(t\u22121) are the maximum-likelihood estimate of the probability pi from the \ufb01rst t \u2212 1 episodes\nand the radius of the con\ufb01dence interval around this estimate, respectively. Formally:\n\n(cid:114)\n\n\u02c6pi,s =\n\n1\ns\n\n1\n2\n\n(\u03c6\u03c4 (i,z)[i] + 1),\n\nct,s =\n\n2 log(t)\n\ns\n\n,\n\n(10)\n\nwhere s is the number of times that item i is chosen and \u03c4 (i, z) is the index of the episode in which\nitem i is chosen for the z-th time. In episode t, we set s to Ti(t \u2212 1), the number of times that item\ni is selected in the \ufb01rst t \u2212 1 episodes. The radius ct,s is designed such that each index is with high\nprobability an upper bound on the corresponding gain. The index enforces exploration of items that\nhave not been chosen very often. As the number of past episodes increases, all con\ufb01dence intervals\nshrink and our method starts exploiting most pro\ufb01table items. The log(t) term guarantees that each\nitem is explored in\ufb01nitely often as t \u2192 \u221e, to avoid linear regret.\nAlgorithm OASM has several notable properties. First, it is a greedy method. Therefore, our policies\ncan be computed very fast. Second, it is guaranteed to behave near optimally as our estimates of the\ngain gi(y) become more accurate. We prove this claim in Section 3.3. Finally, our algorithm learns\nonly L parameters and therefore is quite practical. Speci\ufb01cally, note that if an item is chosen in one\ncontext, it helps in re\ufb01ning the estimate of the gain gi(y) in all other contexts.\n\n3.3 Analysis\n\nIn this section, we prove an upper bound on the expected cumulative regret of Algorithm OASM in n\nepisodes. Before we present the main result, we de\ufb01ne notation used in our analysis. We denote by\ni\u2217(y) = \u03c0g(y) the item chosen by the greedy policy \u03c0g in context y. Without loss of generality, we\nassume that this item is unique in all contexts. The hardness of discriminating between items i and\ni\u2217(y) is measured by a gap between the expected gains of the items:\n\n(11)\nOur analysis is based on counting how many times the policies \u03c0t and \u03c0g choose a different item at\nstep k. Therefore, we de\ufb01ne several variables that describe the state of our problem at this step. We\n\n\u2206i(y) = gi\u2217(y)(y) \u2212 gi(y).\n\n4\n\n\f(cid:96)i\n\nGk\u03b1i,k\n\n+\n\n\u03c02L(L + 1)\n\n(12)\n\ndenote by Yk(\u03c0) =(cid:83)\n\n\u03c6 {\u03c6(cid:104)\u03c0k\u22121(\u03c6)(cid:105)} the set of all possible observations after policy \u03c0 is executed\nfor k \u2212 1 steps. We write Yk = Yk(\u03c0g) and Y t\nk = Yk(\u03c0t) when we refer to the policies \u03c0g and \u03c0t,\nrespectively. Finally, we denote by Yk,i = Yk \u2229 {y : i (cid:54)= i\u2217(y)} the set of contexts where item i is\nsuboptimal at step k.\nOur main result is Theorem 1. Supplementary material for its proof is in Appendix. The terms item\nand arm are treated as synonyms, and we use whichever is more appropriate in a given context.\nTheorem 1. The expected cumulative regret of Algorithm OASM is bounded as:\n\nK(cid:88)\n(cid:123)(cid:122)\n\nk=1\n\nR(n) \u2264 L(cid:88)\n(cid:124)\n(cid:24)\n\ni=1\n\ni\n\nmax\n\n2\n3\n\n(cid:124)\n\n(cid:125)\n\n(cid:25)\n\nK(cid:88)\n\nk=1\n\nGk\n\n,\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nwhere Gk = (K \u2212 k + 1) max\ny\u2208Yk\n\nO(log n)\ngi(y) is an upper bound on the expected gain of the policy \u03c0g\n\nO(1)\n\nfrom step k forward, (cid:96)i,k =\n\n\u00afg2\ni (y)\ni (y) log n\n\u22062\nlikely to be pulled suboptimally at step k, (cid:96)i = max\n\nis a weight that associates the regret of arm i to step k such that(cid:80)K\n\n(cid:96)i,k, and \u03b1i,k =\n\n8 max\ny\u2208Yk,i\n\nk\n\n1\n(cid:96)i\n\nis the number of pulls after which arm i is not\n\n(cid:2)(cid:96)i,k \u2212 max\n\nk(cid:48) (cid:96)i,k} ,\n\n(17)\nwhere (cid:96)i,k is a problem-speci\ufb01c constant. In Lemma 4, we show how to choose (cid:96)i,k such that arm i\nat step k is pulled suboptimally a constant number of times in expectation after (cid:96)i,k pulls. Based on\nthis result, the regret corresponding to the events 1{Ti(t \u2212 1) > (cid:96)i,k} is bounded as:\n\nGkE\u03c61,...,\u03c6n\n\n1i,k,t(\u03c6t)1{Ti(t \u2212 1) > (cid:96)i,k}\n\n\u2264 2\n3\n\n\u03c02L(L + 1)\n\nGk.\n\n(18)\n\nOn the other hand, the regret associated with the events 1{Ti(t \u2212 1) \u2264 (cid:96)i,k} is trivially bounded by\n\ni=1\n\nk=1 Gk(cid:96)i,k. A tighter upper bound is proved below:\n\n(cid:34) n(cid:88)\n\nt=1\n\nL(cid:88)\nK(cid:88)\n(cid:80)K\n\nk=1\n\ni=1\n\n(cid:80)L\n\n(cid:35)\n\nK(cid:88)\n\nk=1\n\n(cid:35)\n\n(cid:35)\n\nE\u03c61,...,\u03c6n\n\n1i,k,t(\u03c6t)1{Ti(t \u2212 1) \u2264 (cid:96)i,k}\n\n1i,k,t(\u03c6t)1{Ti(t \u2212 1) \u2264 (cid:96)i,k}\n\ni=1\n\nL(cid:88)\n\u2264 L(cid:88)\n\u2264 L(cid:88)\n\ni=1\n\nmax\n\n\u03c61,...,\u03c6n\n\nK(cid:88)\n\nGk\n\ni=1\n\nk=1\n\nGk\n\n(cid:34) K(cid:88)\nn(cid:88)\n(cid:34) K(cid:88)\n(cid:20)\n\nGk\n\nk=1\n\nk=1\n\nt=1\n\nn(cid:88)\n\nt=1\n\n(cid:96)i,k \u2212 max\nk(cid:48)