{"title": "Adaptive Sequence Submodularity", "book": "Advances in Neural Information Processing Systems", "page_first": 5352, "page_last": 5363, "abstract": "In many machine learning applications, one needs to interactively select a sequence of items (e.g., recommending movies based on a user's feedback) or make sequential decisions in a certain order (e.g., guiding an agent through a series of states). Not only do sequences already pose a dauntingly large search space, but we must also take into account past observations, as well as the uncertainty of future outcomes. Without further structure, finding an optimal sequence is notoriously challenging, if not completely intractable. In this paper, we view the problem of adaptive and sequential decision making through the lens of submodularity and propose an adaptive greedy policy with strong theoretical guarantees. Additionally, to demonstrate the practical utility of our results, we run experiments on Amazon product recommendation and Wikipedia link prediction tasks.", "full_text": "Adaptive Sequence Submodularity\n\nMarko Mitrovic\nYale University\n\nEhsan Kazemi\nYale University\n\nMoran Feldman\nUniversity of Haifa\n\nmarko.mitrovic@yale.edu\n\nehsan.kazemi@yale.edu\n\nmoranfe@openu.ac.il\n\nAndreas Krause\n\nETH Z\u00a8urich\n\nkrausea@ethz.ch\n\nAmin Karbasi\nYale University\n\namin.karbasi@yale.edu\n\nAbstract\n\nIn many machine learning applications, one needs to interactively select a sequence\nof items (e.g., recommending movies based on a user\u2019s feedback) or make se-\nquential decisions in a certain order (e.g., guiding an agent through a series of\nstates). Not only do sequences already pose a dauntingly large search space, but we\nmust also take into account past observations, as well as the uncertainty of future\noutcomes. Without further structure, \ufb01nding an optimal sequence is notoriously\nchallenging, if not completely intractable. In this paper, we view the problem of\nadaptive and sequential decision making through the lens of submodularity and\npropose an adaptive greedy policy with strong theoretical guarantees. Additionally,\nto demonstrate the practical utility of our results, we run experiments on Amazon\nproduct recommendation and Wikipedia link prediction tasks.\n\n1\n\nIntroduction\n\nThe machine learning community has long recognized the importance of both sequential and adaptive\ndecision making. The study of sequences has led to novel neural architectures such as LSTMs [26],\nwhich have been used in a variety of applications ranging from machine translation [52] to image\ncaptioning [56]. Similarly, the study of adaptivity has led to the establishment of some of the most\npopular sub\ufb01elds of machine learning including active learning [48] and reinforcement learning [53].\nIn this paper, we consider the optimization of problems where both sequences and adaptivity are\nintegral part of the process. More speci\ufb01cally, we focus on problems that can be modeled as selecting\na sequence of items, where each of these items takes on some (initially unknown) state. The idea is\nthat the value of any sequence depends not only on the items selected and the order of these items but\nalso on the states of these items.\nConsider recommender systems as a running example. To start, the order in which we recommend\nitems can be just as important as the items themselves. For instance, if we believe that a user will\nenjoy the Lord of the Rings franchise, it is vital that we recommend the movies in the proper order. If\nwe suggest that the user watches the \ufb01nal installment \ufb01rst, she may end up completely unsatis\ufb01ed with\nan otherwise excellent recommendation. Furthermore, whether it is explicit feedback (such as rating\na movie on Net\ufb02ix) or implicit feedback (such as clicking/not clicking on an advertisement), most\nrecommender systems are constantly interacting with and adapting to each user. It is this feedback\nthat allows us to learn about the states of items we have already selected, as well as make inferences\nabout the states of items we have not selected yet.\nUnfortunately, the expressive modeling power of sequences and adaptivity comes at a cost. Not\nonly does optimizing over sequences instead of sets exponentially increase the size of the search\nspace, but adaptivity also necessitates a probabilistic approach that further complicates the problem.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWithout further assumptions, even approximate optimization is infeasible. As a result, we address\nthis challenge from the perspective of submodularity, an intuitive diminishing returns condition that\nappears in a broad scope of different areas, but still provides enough structure to make the problem\ntractable.\nResearch on submodularity, which itself has been a burgeoning \ufb01eld in recent years, has seen\ncomparatively little focus on sequences and adaptivity. This is especially surprising because many\nproblems that are commonly modeled under the framework of submodularity, such as recommender\nsystems [21, 59] and crowd teaching [49], stand to bene\ufb01t greatly from these concepts.\nWhile the lion\u2019s share of existing research in submodularity has focused on sets, a few recent lines\nof work extend the concept of submodularity to sequences. Tschiatschek et al. [54] were the \ufb01rst to\nconsider sequence submodularity in the general graph-based setting that we will follow in this paper.\nThey presented an algorithm with theoretical guarantees for directed acyclic graphs, while Mitrovic\net al. [45] developed a more comprehensive algorithm that provides theoretical guarantees for general\nhypergraphs.\nIn their experiments, both of these works showed that modeling the problem as sequence submodular\n(as opposed to set submodular) gave noticeable improvements. Their applications could bene\ufb01t\neven further from the aforementioned notions of adaptivity, but the existing theory behind sequence\nsubmodularity simply cannot model the problems in this way. While adaptive set submodularity has\nbeen studied extensively [12, 20, 22, 24], these approaches still fail to capture order dependencies.\nAlaei and Malekian [1] and Zhang et al. [60] also consider sequence submodularity (called string-\nsubmodularity in some works), but they use a different de\ufb01nition, which is based on subsequences\ninstead of graphs. On the other hand, Li and Milenkovic [39] have considered the interaction of\ngraphs and submodularity, but not in the context of sequences.\nOther Related Work Amongst many other applications, submodularity has also been used for\nvariable selection [36], data summarization [32, 40, 44], sensor placement [34], neural network\ninterpretability [14], network inference [23], and in\ufb02uence maximization in social networks [29].\nSubmodularity has also been studied extensively in a wide variety of settings, including distributed\nand scalable optimization [5\u20137, 17, 18, 38, 43, 44], streaming algorithms [3, 10, 11, 19, 28, 35, 46,\n47], robust optimization [9, 27, 37, 50, 55], weak submodularity [13, 15, 16, 31], and continuous\nsubmodularity [2, 4, 25, 51, 58].\nOur Contributions The main contributions of our paper are presented in the following sections:\n\u2022 In Section 2, we introduce our framework of adaptive sequence submodularity, which brings\n\u2022 In Section 3, we present our algorithm for adaptive sequence submodular maximization.\nWe present theoretical guarantees for our approach and we elaborate on the necessity of\nour novel proof techniques. We also show that these techniques simultaneously improve\nthe state-of-the-art bounds for the problem of sequence submodularity by a factor of\ne\u22121.\nFurthermore, we argue that any approximation guarantee must depend on the structure of\nthe underlying graph unless the exponential time hypothesis is false.\n\u2022 In Section 4, we use datasets from Amazon and Wikipedia to compare our algorithm against\nexisting sequence submodular baselines, as well as state-of-the-art deep learning-based\napproaches.\n\ntractability to problems that include both sequences and adaptivity.\n\ne\n\n2 Adaptive Sequence Submodularity\n\nAs discussed above, sequences and adaptivity are an integral part of many real-world problems. This\nmeans that many real-world problems can be modeled as selecting a sequence \u03c3 of items from a\nground set V , where each of these items takes on some (initially unknown) state o \u2208 O. A particular\nmapping of items to states is known as a realization \u03c6, and we assume there is some unknown\ndistribution p(\u03c6) that governs these states.\nFor example in movie recommendation, the set of all movies is our ground set V and our goal is to\nselect a sequence of movies that a particular user will enjoy. If we recommend a movie vi \u2208 V and\nthe user likes it, we place vi in state 1 (i.e. oi = 1). If not, we put it into state 0. Naturally, the value\nof a movie should be higher if the user liked it, and lower if she did not.\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: (a) shows an underlying graph for a movie recommendation problem. The vertices are\nmovies and edges denote the additional value of watching certain movies in certain orders. (b) extends\nthis to the adaptive case, where both the vertices and the edges take on a state. The user has reported\nthat she liked the Fellowship of the Ring (so it is placed in state 1), but she did not like The Two\nTowers (so it is placed in state 0). The state of the last movie is still unknown. In this example, the\nstate of an edge is equal to the state of its starting vertex.\n\nFormally, we want to select a sequence \u03c3 that maximizes f (\u03c3, \u03c6), where f (\u03c3, \u03c6) is the value of\nsequence \u03c3 under realization \u03c6. However, \u03c6 is initially unknown to us and the state of each item in\nthe sequence is revealed to us only after we select it. In fact, even if we knew \u03c6 perfectly, the set of\nall sequences poses an intractably large search space. From an optimization perspective, this problem\nis hopeless without further structural assumptions.\nOur \ufb01rst step towards taming this problem is to follow the work of Tschiatschek et al. [54] and\nassume that the value of a sequence can be de\ufb01ned using a graph. Concretely, we have a directed\ngraph G = (V, E), where each item in our ground set is represented as a vertex v \u2208 V , and the\nedges encode the additional value intrinsic to picking certain items in certain orders. Mathematically,\nselecting a sequence of items \u03c3 will induce a set of edges E(\u03c3):\n\nE(\u03c3) =(cid:8)(\u03c3i, \u03c3j) | (\u03c3i, \u03c3j) \u2208 E, i \u2264 j(cid:9).\n\nFor example, consider the graph in Figure 1a and consider the sequence \u03c3A = [F, T ] where the user\nwatched The Fellowship of the Ring, and then The Two Towers, as well as the sequence \u03c3B = [T, F ]\nwhere the user watched the same two movies but in the opposite order.\n\nE(\u03c3A) = E(cid:0)[F, T ](cid:1) =(cid:8)(F, F ), (T, T ), (F, T )(cid:9)\nE(\u03c3B) = E(cid:0)[T, F ](cid:1) =(cid:8)(T, T ), (F, F )(cid:9)\n\nUsing the self-loops, this graph encodes the fact that there is certainly some intrinsic value to watching\nthese movies regardless of the order. On the other hand, the edge (F, T ) encodes the fact that watching\nThe Fellowship of the Ring before The Two Towers will bring additional value to the viewer, and this\nedge is only induced if the movies appear in the correct order in the sequence.\nWith this graph based set-up, however, we run into issues when it comes to adaptivity. In particular,\nthe states of items naturally translate to states for the vertices, but it is not clear how to extend\nadaptivity to the edges. We tackle this challenge by assigning a state q \u2208 Q to each edge strictly as a\nfunction of the states of its endpoints. That is, similarly to how a sequence \u03c3 induces a set of edges\nE(\u03c3), a realization \u03c6 for the states of the vertices induces a realization \u03c6E for the states of the edges.\nWe want to emphasize that our framework works for any deterministic mapping from vertex states\nto edge states. One simple option that we will use throughout this paper as a running example is to\nde\ufb01ne the state of an edge to always be equal to the state of its start vertex.\nAs we will discuss later, the analysis for this approach will necessitate some novel proof techniques,\nbut the resulting framework is very \ufb02exible and it allows us to fully rede\ufb01ne the adaptive sequence\nproblem in terms of the underlying graph:\n\nf (\u03c3, \u03c6) = h(cid:0)E(\u03c3), \u03c6E(cid:1) where \u03c3 induces E(\u03c3) and \u03c6 induces \u03c6E.\n\nwe will assume that h(cid:0)E(\u03c3), \u03c6E(cid:1) is weakly adaptive set submodular. This is a relaxed version of\n\nThe last necessary ingredient to bring tractability to this problem is submodularity. In particular,\n\nstandard adaptive set submodularity that can model an even larger variety of problems, and it is a\nnatural \ufb01t for the applications we consider in this paper.\nIn order to formally de\ufb01ne weakly-adaptive submodularity, we need a bit more terminology. To start,\nwe de\ufb01ne a partial realization \u03c8 to be a mapping for only some subset of items (i.e., the states of\n\n3\n\nThe \nTwo\nTowersThe\nFellowship\nOf The \nRingFTRThe \nReturn\nOf The\nKingThe \nTwo\nTowersThe\nFellowship\nOf The \nRingThe \nReturn\nOf The\nKing(F, 1)(T, 0)(R, ?)\fthe remaining items are unknown). For notational convenience, we de\ufb01ne the domain of \u03c8, denoted\ndom(\u03c8), to be the list of items v for which the state of v is known. We say that \u03c8 is a subrealization\nof \u03c8(cid:48), denoted \u03c8 \u2286 \u03c8(cid:48), if dom(\u03c8) \u2286 dom(\u03c8(cid:48)) and they are equal everywhere in the domain of \u03c8.\nIntuitively, if \u03c8 \u2286 \u03c8(cid:48), then \u03c8(cid:48) has all the same information as \u03c8, and potentially more.\nGiven a partial realization \u03c8, we de\ufb01ne the marginal gain of a set A as\n\n\u2206(A | \u03c8) = E(cid:104)\n\nh(cid:0)dom(\u03c8) \u222a A, \u03c6(cid:1) \u2212 h(cid:0)dom(\u03c8), \u03c6(cid:1) | \u03c8\n\n(cid:105)\n\n,\n\nwhere the expectation is taken over all the full realizations \u03c6 such that \u03c8 \u2286 \u03c6. In other words, we\ncondition on the states given by the partial realization \u03c8, and then we take the expectation across all\npossibilities for the remaining states.\nDe\ufb01nition 1. A function h : 2E \u00d7 QE \u2192 R\u22650 is weakly adaptive set submodular with parameter \u03b3\nif for all sets A \u2286 E and for all \u03c8 \u2286 \u03c8(cid:48) we have:\n\u2206(A | \u03c8(cid:48)) \u2264 1\n\u03b3\n\n\u00b7(cid:88)\n\n\u2206(e | \u03c8).\n\ne\u2208A\n\nThis notion is a natural generalization of weak submodular functions [13] to adaptivity. The primary\ndifference is that we condition on subrealizations instead of just sets because we need to account\nfor the states of items. Note that in the context of this paper h is a function on the edges, so we\nwill condition on subrealizations of the edges \u03c8E. However, these concepts apply more generally to\nfunctions on any set and state spaces, so we use \u03c8 in the formal de\ufb01nitions.\nDe\ufb01nition 2. A function h : 2E \u00d7 QE \u2192 R\u22650 is adaptive monotone if \u2206(e | \u03c8) \u2265 0 for all partial\nrealizations \u03c8. That is, the conditional expected marginal bene\ufb01t of any element is non-negative.\n\nFigure 1b is designed to help clarify these concepts. It includes the same graph as Figure 1a, but now\nwe can receive feedback from the user. If we recommend a movie and the user likes it, we put the\ncorresponding vertex in state 1 (green in the image). Otherwise, we put the vertex in state 0 (red in\nthe image). Vertices whose states are still unknown are denoted by a dotted black line.\nNext, in our example, we need to de\ufb01ne a state for each edge in terms of the states of its endpoints.\nIn this case, we will de\ufb01ne the state of each edge to be equal to the state of its start point. In Figure\n1b, the user liked The Fellowship of the Ring, which puts edges (F, F ), (F, T ), and (F, R) in state 1\n(green). She did not like The Two Towers, so edges (T, T ) and (T, R) are in state 0 (red), and we do\nnot know the state for The Return of the King, so the state of (R, R) is also unknown. We call this\npartial realization \u03c81 for the vertices, and the induced partial realization for the edges \u03c8E\n1 .\nSuppose our function h counts all induced edges that are in state 1. Furthermore, let us simply assume\nthat any unknown vertex is equally likely to be in state 0 or state 1. This means that the self-loop (R, R)\n2 \u00d71 = 1\n2.\nOn the other hand, consider the edge (F, R). Under \u03c81, we know F is in state 1, which means\n\u03c82 \u2286 \u03c81 where we do not know the state of F , then it is equally likely to be in either state and\n2. Therefore, for this simple function we know that \u03b3 \u2264 0.5.\n\nis also equally likely to be in either state 0 or state 1. Therefore, \u2206(cid:0)(R, R) | \u03c8E\n(F, R) is also in state 1, and thus, \u2206(cid:0)(F, R) | \u03c8E\n\u2206(cid:0)(F, R) | \u03c8E\n\n(cid:1) = 1. However, if we consider a subrealization\n\n(cid:1) = 1\n\n(cid:1) = 1\n\n2 \u00d7 0 + 1\n\n2 \u00d7 1 = 1\n\n2 \u00d70+ 1\n\n2\n\n1\n\n1\n\n3 Adaptive Sequence-Greedy Policy and Theoretical Results\n\nIn this section, we introduce our Adaptive Sequence-Greedy policy and present its theoretical\nguarantees. We \ufb01rst formally de\ufb01ne weakly adaptive sequence submodularity.\nDe\ufb01nition 3. A function f (\u03c3, \u03c6) de\ufb01ned over a graph G(V, E) is weakly adaptive sequence sub-\n\nmodular if f (\u03c3, \u03c6) = h(cid:0)E(\u03c3), \u03c6E(cid:1) where a sequence \u03c3 of vertices in V induces a set of edges E(\u03c3),\n\nrealization \u03c6 induces \u03c6E, and the function h is weakly adaptive set submodular. Note that if h is\nadaptive monotone, then f is also adaptive monotone.\n\nFormally, a policy \u03c0 is an algorithm that builds a sequence of k vertices by seeing which states have\nbeen observed at each step, then deciding which vertex should be chosen and observed next. If \u03c3\u03c0,\u03c6\nis the sequence returned by policy \u03c0 under realization \u03c6, then we write the expected value of \u03c0 as:\n\nfavg(\u03c0) = E(cid:2)f (\u03c3\u03c0,\u03c6, \u03c6)(cid:3) = E(cid:104)\n\nh(cid:0)E(\u03c3\u03c0,\u03c6), \u03c6E(cid:1)(cid:105)\n\n4\n\n\fwhere again the expectation is taken over all possible realizations \u03c6. Our goal is to \ufb01nd a policy \u03c0\nthat maximizes favg(\u03c0), as de\ufb01ned above.\nOur Adaptive Sequence Greedy policy \u03c0 (Algorithm 1) starts with an empty sequence \u03c3. Throughout\nthe policy, we de\ufb01ne \u03c8\u03c3 to be the partial realization for the vertices in \u03c3. In turn this gives us the\npartial realization \u03c8E\nAt each step, we de\ufb01ne the valid set of edges E to be the edges whose endpoint is not already in \u03c3.\nThe main idea of our policy is that, at each step, we select the valid edge e \u2208 E with the highest\nexpected value \u2206(e | \u03c8E\n\u03c3 ). For each such edge, the endpoints that are not already in the sequence \u03c3\nare concatenated (\u2295 means concatenate) to the end of \u03c3, and their states are observed (updating \u03c8\u03c3).\n\n\u03c3 for the induced edges.\n\nE = {eij \u2208 E | vj /\u2208 \u03c3}\nif E (cid:54)= \u2205 then\n\nh(cid:0)E(\u03c3), \u03c6E(cid:1), and cardinality constraint k\n\nAlgorithm 1 Adaptive Sequence Greedy Policy \u03c0\n1: Input: Directed graph G = (V, E), weakly adaptive sequence submodular f (\u03c3, \u03c6) =\n2: Let \u03c3 \u2190 ()\n3: while |\u03c3| \u2264 k \u2212 2 do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: end while\n16: Return \u03c3\n\n\u03c3 = \u03c3 \u2295 vj and observe state of vj\n\u03c3 = \u03c3 \u2295 vi \u2295 vj and observe states of vi, vj\n\neij = arg maxe\u2208E \u2206(e | \u03c8E\n\u03c3 )\nif vi = vj or vi \u2208 \u03c3 then\nelse\n\nend if\n\nelse\n\nbreak\n\nend if\n\nTheorem 1. For adaptive monotone and weakly adaptive sequence submodular function f, the\nAdaptive Sequence Greedy policy \u03c0 represented by Algorithm 1 achieves\n\nfavg(\u03c0) \u2265\n\n\u03b3\n\n\u00b7 favg(\u03c0\u2217),\n\n2din + \u03b3\n\nwhere \u03b3 is the weakly adaptive submodularity parameter, \u03c0\u2217 is the policy with the highest expected\nvalue and din is the largest in-degree of the input graph G.\nAs discussed by Mitrovic et al. [45], using a hypergraph H instead of a normal graph G allows us to\nencode more intricate relationships between the items. For example, in Figure 1a, the edges only\nencode pairwise relationships. However, there may be relationships between larger groups of items\nthat we want to encode explicitly. For instance, if included, the value of a hyperedge (F, T, R) in\nFigure 1a would explicitly encode the value of watching The Fellowship of the Ring, followed by\nwatching The Two Towers, and then concluding with The Return of the King.\nWe can also extend our policy to general hypergraphs (see Algorithm 2 in Appendix B.3). Theorem 2\nguarantees the performance of our proposed policy for hypergraphs.\nTheorem 2. For adaptive monotone and weakly adaptive sequence submodular function f, the policy\n\u03c0(cid:48) represented by Algorithm 2 achieves\n\nfavg(\u03c0(cid:48)) \u2265\n\n\u03b3\n\n\u00b7 favg(\u03c0\u2217),\n\nrdin + \u03b3\n\nwhere \u03b3 is the weakly adaptive submodularity parameter, \u03c0\u2217 is the policy with the highest expected\nvalue and r is the size of the largest hyperedge in the input hypergraph.\n\nIn our proofs, we have to handle the sequential nature of picking items and the revelation of states in a\ncombined setting. Unfortunately, the existing proof methods for sequence submodular maximization\nare not linear enough to allow for the use of the linearity of expectation that captures the stochasticity\nof the states. For this reason, we develop a novel analysis technique to guarantee the performance\n\n5\n\n\fe\n\ne\u22121.\n\nof our algorithms. Our proof replaces several lemmas from Mitrovic et al. [45] with tighter, more\nlinear analyses. Surprisingly, these new techniques also improve the theoretical guarantees of the\nnon-adaptive Sequence-Greedy and Hyper Sequence-Greedy [45] by a factor of\nProofs for both theorems are given in Appendix B.\nGeneral Unifying Framework One more theoretical point we want to highlight is that weakly\nadaptive sequence submodularity provides a general unifying framework for a variety of common\nsubmodular settings including, adaptive submodularity, weak submodularity, sequence submodularity,\nand classical set submodularity. If we have \u03b3 = 1 and the state of all vertices is deterministic, then\nwe have sequence submodularity. Conversely, if the vertex states are unknown, but our graph only\nhas self-loops, then we have weakly adaptive set submodularity (and correspondingly adaptive set\nsubmodularity if \u03b3 = 1). Lastly, if we have a graph with only self-loops, full knowledge of all states,\nand \u03b3 = 1, then we recover the original setting of classical set submodularity.\nTightness of Theoretical Results We acknowledge that the constant factor approximation we present\ndepends on the maximum in-degree. While ideally the theoretical bound would be completely\nindependent of the structure of the graph, we argue here that such a dependence is likely necessary.\nIndeed, getting a dependence better than O(n1/4) in the approximation factor (where n is the total\nnumber of items) would improve the state-of-the-art algorithm for the very well-studied densest k\nsubgraph problem (DkS) [8, 33]. Moreover, if we could get an approximation that is completely\nindependent of the structure of the graph, then the exponential time hypothesis would be proven false1.\nIn fact, even an almost polynomial approximation would break the exponential time hypothesis [41].\nNext, we formally state this hardness relationship. The proof is given in Appendix C.\nTheorem 3. Assuming the exponential time hypothesis is correct, there is no algorithm that ap-\nproximates the optimal solution for the (adaptive) sequence submodular maximization problem\nwithin a n1/(log log n)c factor, where n is the total number of items and c > 0 is a universal constant\nindependent of n.\n\n4 Experimental Results\n\n4.1 Amazon Product Recommendation\n\nUsing the Amazon Video Games review dataset [42], we consider the task of recommending products\nto users. In particular, given the \ufb01rst g products that the user has purchased, we want to predict the\nnext k products that she will buy. Full experimental details are given in Appendix D.1. Dataset and\ncode are attached in the supplementary material.\nWe start by using the training data to build a graph G = (V, E), where V is the set of all products\nand E is the set of edges between these products. The weight of each edge, wij, is de\ufb01ned to be the\nconditional probability of purchasing product j given that the user has previously purchased product i.\nThere are also self-loops with weight wii that represent the fraction of users that purchased product i.\nWe de\ufb01ne the state of each edge (i, j) to be equal to the state of product i. The intuitive idea is\nthat edge (i, j) encodes the value of purchasing product j after already having purchased product i.\nTherefore, if the user has de\ufb01nitely purchased i (i.e., product i is in state 1), then they should receive\nthe full value of wij. On the other hand, if she has de\ufb01nitely not purchased i (i.e., product i is in state\n0), then edge (i, j) provides no value. Lastly, if the state of i is unknown, then the expected gain\nof edge (i, j) is discounted by wii, the value of the self-loop on i, which can be viewed as a simple\nestimate for the probability of the user purchasing product i. See Figure 2a for a small example.\nWe use a probabilistic coverage utility function as our monotone weakly-adaptive set submodular\nfunction h. Mathematically,\n\nwhere E1 \u2286 E is the subset of edges that are in state 1. Note that with this set-up, the value of \u03b3 can\nbe dif\ufb01cult to calculate exactly. However, roughly speaking, it is inversely proportional to the value\nof the smallest weight self-loop wii.\n\n1If the exponential time hypothesis is true it would imply that P (cid:54)= NP, but it is a stronger statement.\n\n6\n\n(cid:88)\n\nj\u2208V\n\n(cid:104)\n\n1 \u2212 (cid:89)\n\n(i,j)\u2208E1\n\n(cid:105)\n\n(1 \u2212 wij)\n\n,\n\nh(E1) =\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 2: (a) shows a small subset of the underlying graph with states for a particular user. (b) and (c)\nshow our results on the Amazon product recommendation task. In all these graphs, the number of\ngiven products g is 4. (d) gives an example illustrating the difference between the two performance\nmeasures. (e) and (f) show our results on the same task, but using only 1% of the available training to\nshow that our algorithm outperforms deep learning-based approaches in data scarce environments.\n\nWe compare the performance of our Adaptive Sequence-Greedy policy against Sequence-Greedy\nfrom Mitrovic et al. [45], the existing sequence submodularity baseline that does not consider states.\nTo give further context for our results, we compare against Frequency, a naive baseline that ignores\nsequences and adaptivity and simply outputs the k most popular products.\nWe also compare against a set of deep learning-based approaches (see Appendix D.3 for full details).\nIn particular, we implement adaptive and non-adaptive versions of both a regular Feed Forward\nNeural Network and an LSTM. The adaptive version will update its inputs after every prediction to\nre\ufb02ect whether or not the user liked the recommendation. Conversely, the non-adaptive version will\nsimply make k predictions using just the original input.\nWe use two different measures to compare the various algorithms. The \ufb01rst is the Accuracy Score,\nwhich simply counts the number of recommended products that the user indeed ended up purchasing.\nWhile this is a sensible measure, it does not explicitly consider the order of the sequence. Therefore,\nwe also consider the Sequence Score, which is a measure based on the Kendall-Tau distance [30]. In\nshort, this measure counts the number of ordered pairs that appear in both the predicted sequence and\nthe true sequence. Figure 2d gives an example comparing the two measures.\nFigures 2b and 2c show the performance of the various algorithms using the accuracy score and\nsequence score, respectively. These results highlight the importance of adaptivity as the adaptive al-\ngorithms consistently outperform their non-adaptive counterparts under both scoring regimes. Notice\nthat in both cases, as the number of recommendations increases, our proposed Adaptive Sequence-\nGreedy policy is outperformed only by the Adaptive Feed Forward Neural Network. Although\nLSTMs are generally considered better for sequence data than vanilla feed-forward networks, we\nthink it is a lack of data that causes them to perform poorly in our experiments.\nAnother observation, which \ufb01ts the conventional wisdom, is that deep learning-based approaches\ncan perform well when there is a lot of data. However, when the data is scarce, we see that the\nSequence-Greedy based approaches outperform the deep learning-based approaches. Figures 2e\nand 2f simulate a data-scarce environment by using only 1% of the available data as training data.\nNote that the difference between the adaptive algorithms and their non-adaptive counterparts is\n\n7\n\n.0001.022.0690.002.0050.01WiiOlympicsWiiSportsMarioKartWiiWheel123456Number of Recommendations0.20.40.60.81.01.2Accuracy ScoreAdaptive Sequence-GreedySequence-GreedyFrequencyAdaptive Feed Forward NNNon-Adaptive Feed Forward NNAdaptive LSTMNon-Adaptive LSTM123456Number of Recommendations0.250.500.751.001.251.501.752.00Sequence ScoreAdaptive Sequence-GreedySequence-GreedyFrequencyAdaptive Feed Forward NNNon-Adaptive Feed Forward NNAdaptive LSTMNon-Adaptive LSTMA, B, C, DJ, A, M, DAccuracy Score: 2A, BA, CA, DB, BA, AB, CB, DC, DC, CD, DJ, AJ, MJ, DA, AJ, JA, MA, DM, DM, MD, DSequence Score: 3{{True SequencePredicted Sequence123456Number of Recommendations0.050.100.150.200.250.300.350.40Accuracy ScoreAdaptive Sequence-GreedySequence-GreedyFrequencyAdaptive Feed Forward NNNon-Adaptive Feed Forward NNAdaptive LSTMNon-Adaptive LSTM123456Number of Recommendations0.050.100.150.200.250.300.350.400.45Sequence ScoreAdaptive Sequence-GreedySequence-GreedyFrequencyAdaptive Feed Forward NNNon-Adaptive Feed Forward NNAdaptive LSTMNon-Adaptive LSTM\fless obvious in this setting because the adaptive algorithms use correct guesses to improve future\nrecommendations, but the data scarcity makes it dif\ufb01cult to make a correct guess in the \ufb01rst place.\nAside from competitive accuracy and sequence scores, the Adaptive Sequence-Greedy algorithm\nprovides several advantages over the neural network-based approaches. From a theoretical perspective,\nthe Adaptive Sequence-Greedy algorithm has provable guarantees on its performance, while little\nis known about the theoretical performance of neural networks. Furthermore, the decisions made\nby the Adaptive Sequence-Greedy algorithm are easily interpretable and understandable (it is just\npicking the edge with the highest expected value), while neural networks are generally a black-box.\nOn a similar note, Adaptive Sequence-Greedy may be preferable from an implementation perspective\nbecause it does not require any hyperparameter tuning. It is also more robust to changing inputs in\nthe sense that we can easily add another product and its associated edges to our graph, but adding\nanother product to the neural network requires changing the entire input and output structure, and\nthus, generally necessitates retraining the entire network.\n\n4.2 Wikipedia Link Prediction\n\nUsing the Wikispeedia dataset [57], we consider users who are sur\ufb01ng through Wikipedia towards\nsome target article. Given a sequence of articles the user has previously visited, we want to guide her\nto the page she is trying to reach. Since different pages have different valid links, the order of pages\nwe visit is critical to this task. Formally, given the \ufb01rst g = 3 pages each user visited, we want to\npredict which page she is trying to reach by making a series of suggestions for which link to follow.\nIn this case, we have G = (V, E), where V is the set of all pages and E is the set of existing links\nbetween pages. Similarly to before, the weight wij of an edge (i, j) \u2208 E is the probability of moving\nto page j given that the user is currently at page i. In this case, there are no self-loops as we assume\nwe can only move using links, and thus we cannot jump to random pages. We again de\ufb01ne two states\nfor the nodes: 1 if the user de\ufb01nitely visits this page and 0 if the user does not want to visit this page.\nThis application highlights the importance of adaptivity because the non-adaptive sequence submodu-\nlarity framework cannot model this problem properly. This is because the Sequence-Greedy algorithm\nis free to choose any edge in the underlying graph, so there is no way to force the algorithm to pick a\nlink that is connected to the user\u2019s current page. On the other hand, with Adaptive Sequence-Greedy,\nwe can use the states to penalize invalid edges, and thus force the algorithm to select only links\nconnected to the user\u2019s current page. Similarly, we only have the adaptive versions of the deep\nlearning baselines because we need information about our current page in order to construct a valid\npath (Appendix D.3 gives a more detailed explanation).\nFigure 3a shows an example of predicted paths, while Figure 3b shows our quantitative results.\nMore detail about the relevance distance metric is given in Appendix D.2, but the idea is that the\nit measures the relevance of the \ufb01nal output page to the true target page (a lower score indicates\na higher relevance). The main observation here is that the Adaptive Sequence Greedy algorithm\nactually outperforms the deep-learning based approaches. The main reason for this discrepancy is\nlikely a lack of data as we have 619 pages to choose from and only 7,399 completed search paths.\n\n5 Conclusion\n\nIn this paper we introduced adaptive sequence submodularity, a general framework for bringing\ntractability to the broad class of optimization problems that consider both sequences and adaptivity.\nWe presented Adaptive Sequence-Greedy\u2014a general policy for optimizing weakly adaptive sequence\nsubmodular functions. We provide a provable theoretical guarantee for our algorithm, as well as\na discussion about the tightness of our result. Our novel analysis also improves the theoretical\nguarantees of Sequence-Greedy and Hyper Sequence-Greedy [45] by a factor of\ne\u22121. Finally, we\nevaluated the performance of Adaptive Sequence-Greedy on an Amazon product recommendation\ntask and a Wikipedia link prediction task. Not only does our Adaptive Sequence-Greedy policy exhibit\ncompetitive performance with the state-of-the-art, but it also provides several notable advantages,\nincluding interpretability, ease of implementation, and robustness against both data scarcity and input\nadjustments.\nAcknowledgements. This work was partially supported by NSF (IIS-1845032), ONR (N00014-\n19-1-2406), AFOSR (FA9550-18-1-0160), ISF (1357/16), and ERC StG SCADAPT.\n\ne\n\n8\n\n\f(a)\n\n(b)\n\nFigure 3: (a) The left side shows the real path a user followed from Batman to Computer. Given the\n\ufb01rst three pages, the right side shows the path predicted by Adaptive Sequence Greedy versus a deep\nlearning-based approach. Green shows correct guesses that were followed, while red shows incorrect\nguesses that were not pursued further. (b) shows the overall performance of the various approaches.\n\nReferences\n[1] Saeed Alaei and Azarakhsh Malekian. Maximizing sequence-submodular functions and its\n\napplication to online advertising. arXiv preprint arXiv:1009.4153, 2010.\n\n[2] Francis Bach. Submodular functions: from discrete to continous domains. arXiv preprint\n\narXiv:1511.00394, 2015.\n\n[3] Ashwin Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streaming\nsubmodular maximization: Massive data summarization on the \ufb02y. In Knowledge Discovery\nand Data Mining (KDD), 2014.\n\n[4] Wenruo Bai, William Stafford Noble, and Jeff A. Bilmes. Submodular Maximization via\nGradient Ascent: The Case of Deep Submodular Functions. In Advances in Neural Information\nProcessing Systems, pages 7989\u20137999, 2018.\n\n[5] Eric Balkanski and Yaron Singer. The adaptive complexity of maximizing a submodular\n\nfunction. In Symposium on Theory of Computing, STOC, pages 1138\u20131151, 2018.\n\n[6] Eric Balkanski, Aviad Rubinstein, and Yaron Singer. An Exponential Speedup in Parallel\nRunning Time for Submodular Maximization without Loss in Approximation. In Symposium\non Discrete Algorithms (SODA), pages 283\u2013302, 2019.\n\n[7] Rafael Barbosa, Alina Ene, Huy Nguyen, and Justin Ward. The power of randomization:\nDistributed submodular maximization on massive datasets. In International Conference on\nMachine Learning (ICML), 2015.\n\n[8] Aditya Bhaskara, Moses Charikar, Eden Chlamtac, Uriel Feige, and Aravindan Vijayaraghavan.\nDetecting high log-densities: an O(n1/4) approximation for densest k-subgraph. In Symposium\non Theory of Computing (STOC), pages 201\u2013210, 2010.\n\n[9] Ilija Bogunovic, Slobodan Mitrovic, Jonathan Scarlett, and Volkan Cevher. Robust Submodular\nMaximization: A Non-Uniform Partitioning Approach. In International Conference on Machine\nLearning (ICML), 2017.\n\n[10] Niv Buchbinder, Moran Feldman, and Roy Schwartz. Online submodular maximization with\n\npreemption. In Symposium on Discrete Algorithms (SODA), 2015.\n\n[11] Amit Chakrabarti and Sagar Kale. Submodular maximization meets streaming: Matchings,\n\nmatroids, and more. IPCO, 2014.\n\n[12] Yuxin Chen and Andreas Krause. Near-optimal Batch Mode Active Learning and Adaptive\nSubmodular Optimization. In International Conference on Machine Learning (ICML), 2013.\n\n9\n\nElectronicsBatmanScienceComputerScienceMicrosoftChemistryTechnologyComputerProgrammingComputerScienceTechnologyComputerScienceElectronicsTechnologyBiologyPhysicsAdaptive Sequence GreedyAdaptive Feed ForwardNeural NetworkReal Path12345Number of guesses2.052.102.152.202.252.30Relevance DistanceAdaptive Sequence-GreedyAdaptive Feed Forward NNAdaptive LSTM\f[13] Abhimanyu Das and David Kempe. Submodular meets Spectral: Greedy Algorithms for Subset\nSelection, Sparse Approximation and Dictionary Selection. In International Conference on\nMachine Learning (ICML), pages 1057\u20131064, 2011.\n\n[14] Ethan Elenberg, Alexandros Dimakis, Moran Feldman, and Amin Karbasi. Streaming Weak\nSubmodularity: Interpreting Neural Networks on the Fly. In Advances in Neural Information\nProcessing Systems, 2018.\n\n[15] Ethan R. Elenberg, Rajiv Khanna, Alexandros G. Dimakis, and Sahand N. Negahban. Restricted\n\nstrong convexity implies weak submodularity. CoRR, abs/1612.00804, 2016.\n\n[16] Ethan R. Elenberg, Alexandros G. Dimakis, Moran Feldman, and Amin Karbasi. Stream-\ning Weak Submodularity: Interpreting Neural Networks on the Fly. In Advances in Neural\nInformation Processing Systems, pages 4047\u20134057, 2017.\n\n[17] Alina Ene and Huy L. Nguyen. Submodular Maximization with Nearly-optimal Approximation\nand Adaptivity in Nearly-linear Time. In Symposium on Discrete Algorithms (SODA), pages\n274\u2013282, 2019.\n\n[18] Matthew Fahrbach, Vahab S. Mirrokni, and Morteza Zadimoghaddam. Submodular Maximiza-\ntion with Nearly Optimal Approximation, Adaptivity and Query Complexity. In Symposium on\nDiscrete Algorithms (SODA), pages 255\u2013273, 2019.\n\n[19] Moran Feldman, Amin Karbasi, and Ehsan Kazemi. Do Less, Get More: Streaming Submodular\nMaximization with Subsampling. In Advances in Neural Information Processing Systems, pages\n730\u2013740, 2018.\n\n[20] Kaito Fujii and Shinsaku Sakaue. Beyond Adaptive Submodularity: Approximation Guarantees\nof Greedy Policy with Adaptive Submodularity Ratio. In International Conference on Machine\nLearning (ICML), 2019.\n\n[21] Victor Gabillon, Branislav Kveton, Zheng Wen, Brian Eriksson, and S. Muthukrishnan. Adaptive\nsubmodular maximization in bandit settings. In Advances in Neural Information Processing\nSystems, 2013.\n\n[22] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in\nactive learning and stochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 42:\n427\u2013486, 2011.\n\n[23] Manuel Gomez Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diffusion\n\nand in\ufb02uence. In Knowledge Discovery and Data Mining (KDD), 2010.\n\n[24] Alkis Gotovos, Amin Karbasi, and Andreas Krause. Non-monotone adaptive submodular\n\nmaximization. In International Joint Conferences on Arti\ufb01cial Intelligence (IJCAI), 2015.\n\n[25] S. Hamed Hassani, Mahdi Soltanolkotabi, and Amin Karbasi. Gradient Methods for Submodular\nMaximization. In Advances in Neural Information Processing Systems, pages 5843\u20135853, 2017.\n\n[26] Sepp Hochreiter and Jrgen Schmidhuber. Long short-term memory. In Neural Computation,\n\n1997.\n\n[27] Ehsan Kazemi, Morteza Zadimoghaddam, and Amin Karbasi. \u201dScalable Deletion-Robust\nSubmodular Maximization: Data Summarization with Privacy and Fairness Constraints. In\nInternational Conference on Machine Learning (ICML), 2018.\n\n[28] Ehsan Kazemi, Marko Mitrovic, Morteza Zadimoghaddam, Silvio Lattanzi, and Amin Karbasi.\nSubmodular Streaming in All Its Glory: Tight Approximation, Minimum Memory and Low\nAdaptive Complexity. In International Conference on Machine Learning (ICML), pages 3311\u2013\n3320, 2019.\n\n[29] David Kempe, Jon Kleinberg, and \u00b4Eva Tardos. Maximizing the spread of in\ufb02uence through a\n\nsocial network. In Knowledge Discovery and Data Mining (KDD), 2003.\n\n[30] Maurice Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81\u201393, 1938.\n\n10\n\n\f[31] Rajiv Khanna, Ethan R. Elenberg, Alexandros G. Dimakis, Sahand N. Negahban, and Joydeep\nGhosh. Scalable Greedy Feature Selection via Weak Submodularity. In Arti\ufb01cial Intelligence\nand Statistics (AISTATS), pages 1560\u20131568, 2017.\n\n[32] Katrin Kirchhoff and Jeff Bilmes. Submodularity for data selection in statistical machine\n\ntranslation. In Empirical Methods in Natural Language Processing (EMNLP), 2014.\n\n[33] Guy Kortsarz and David Peleg. On Choosing a Dense Subgraph (Extended Abstract). In\n\nFoundations of Computer Science (FOCS), pages 692\u2013701, 1993.\n\n[34] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes:\nTheory, ef\ufb01cient algorithms and empirical studies. In Journal of Machine Learning Research,\nvolume 9, 2008.\n\n[35] Andreas Krause and Ryan G Gomes. Budgeted nonparametric learning from data streams. In\n\nInternational Conference on Machine Learning (ICML), 2010.\n\n[36] Andreas Krause and Carlos Guestrin. Near-optimal nonmyopic value of information in graphical\n\nmodels. In Uncertainty in Arti\ufb01cial Intelligence (UAI), 2005.\n\n[37] Andreas Krause, H Brendan McMahon, Carlos Guestrin, and Anupam Gupta. Robust submodu-\n\nlar observation selection. Journal of Machine Learning Research, 2008.\n\n[38] Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, and Andrea Vattani. Fast greedy algo-\nrithms in mapreduce and streaming. In Symposium on Parallelism in Algorithms and Architec-\ntures (SPAA), 2013.\n\n[39] Pan Li and Olgica Milenkovic. Inhomogeneous hypergraph clustering with applications. Ad-\n\nvances in Neural Information Processing Systems, 2017.\n\n[40] Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In\n\nAssociation for Computational Linguistics (ACL), 2011.\n\n[41] Pasin Manurangsi. Almost-polynomial Ratio ETH-hardness of Approximating Densest K-\n\nsubgraph. In Symposium on Theory of Computing (STOC), pages 954\u2013961, 2017.\n\n[42] J. McAuley, C. Targett, J. Shi, and A. van den Hengel. Image-based recommendations on styles\nand substitutes. In SIGIR Conference on Research and Development in Information Retrieval,\n2015.\n\n[43] Vahab Mirrokni and Morteza Zadimoghaddam. Randomized composable core-sets for dis-\n\ntributed submodular maximization. In Symposium on Theory of Computing (STOC), 2015.\n\n[44] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodu-\nlar maximization: Identifying representative elements in massive data. In Advances in Neural\nInformation Processing Systems, 2013.\n\n[45] Marko Mitrovic, Moran Feldman, Andreas Krause, and Amin Karbasi. Submodularity on\n\nhypergraphs: From sets to sequences. Arti\ufb01cial Intelligence and Statistics (AISTATS), 2018.\n\n[46] Marko Mitrovic, Ehsan Kazemi, Morteza Zadimoghaddam, and Amin Karbasi. Data summa-\nrization at scale: A two-stage submodular approach. In International Conference on Machine\nLearning (ICML), 2018.\n\n[47] Ashkan Norouzi-Fard, Jakub Tarnawski, Slobodan Mitrovic, Amir Zandieh, Aidasadat Mousav-\nifar, and Ola Svensson. Beyond 1/2-Approximation for Submodular Maximization on Massive\nData Streams. In International Conference on Machine Learning (ICML), pages 3826\u20133835,\n2018.\n\n[48] Burr Settles. Active learning. In Synthesis Lectures on Arti\ufb01cial Intelligence and Machine\n\nLearning, 2012.\n\n[49] Adish Singla, Ilija Bogunovic, G\u00b4abor Bart\u00b4ok, Amin Karbasi, and Andreas Krause. Near-\noptimally teaching the crowd to classify. In International Conference on Machine Learning\n(ICML), 2014.\n\n11\n\n\f[50] Matthew Staib and Stefanie Jegelka. Robust Budget Allocation via Continuous Submodular\nFunctions. In International Conference on Machine Learning (ICML), pages 3230\u20133240, 2017.\n\n[51] Matthew Staib, Bryan Wilder, and Stefanie Jegelka. Distributionally Robust Submodular\n\nMaximization. CoRR, abs/1802.05249, 2018.\n\n[52] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in Neural Information Processing Systems, 2014.\n\n[53] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, 2nd edition.\n\nIn MIT Press, 2018.\n\n[54] Sebastian Tschiatschek, Adish Singla, and Andreas Krause. Selecting sequences of items via\n\nsubmodular maximization. In AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[55] Vasileios Tzoumas, Konstantinos Gatsis, Ali Jadbabaie, and George J. Pappas. Resilient\nmonotone submodular function maximization. In Conference on Decision and Control (CDC),\n2017.\n\n[56] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural\n\nimage caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[57] Robert West, Joelle Pineau, and Doina Precup. An Online Game for Inferring Semantic\nDistances between Concepts. In International Joint Conferences on Arti\ufb01cial Intelligence\n(IJCAI), 2009.\n\n[58] Laurence A. Wolsey. An analysis of the greedy algorithm for the submodular set covering\n\nproblem. Combinatorica, 1982.\n\n[59] Yisong Yue and Carlos Guestrin. Linear submodular bandits and its application to diversi\ufb01ed\n\nretrieval. In Advances in Neural Information Processing Systems, 2011.\n\n[60] Zhenliang Zhang, Edwin K. P. Chong, Ali Pezeshki, and William Moran. String Submodular\nFunctions with Curvature Constraints. IEEE Transactions on Automatic Control, 61(3):601\u2013616,\n2016.\n\n12\n\n\f", "award": [], "sourceid": 2882, "authors": [{"given_name": "Marko", "family_name": "Mitrovic", "institution": "Yale University"}, {"given_name": "Ehsan", "family_name": "Kazemi", "institution": "Yale"}, {"given_name": "Moran", "family_name": "Feldman", "institution": "Open University of Israel"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}, {"given_name": "Amin", "family_name": "Karbasi", "institution": "Yale"}]}