{"title": "Budgeted stream-based active learning via adaptive submodular maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 514, "page_last": 522, "abstract": "Active learning enables us to reduce the annotation cost by adaptively selecting unlabeled instances to be labeled. For pool-based active learning, several effective methods with theoretical guarantees have been developed through maximizing some utility function satisfying adaptive submodularity. In contrast, there have been few methods for stream-based active learning based on adaptive submodularity. In this paper, we propose a new class of utility functions, policy-adaptive submodular functions, and prove this class includes many existing adaptive submodular functions appearing in real world problems. We provide a general framework based on policy-adaptive submodularity that makes it possible to convert existing pool-based methods to stream-based methods and give theoretical guarantees on their performance. In addition we empirically demonstrate their effectiveness comparing with existing heuristics on common benchmark datasets.", "full_text": "Budgeted stream-based active learning\nvia adaptive submodular maximization\n\nKaito Fujii\n\nKyoto University\n\nJST, ERATO, Kawarabayashi Large Graph Project\n\nfujii@ml.ist.i.kyoto-u.ac.jp\n\nHisashi Kashima\nKyoto University\n\nkashima@i.kyoto-u.ac.jp\n\nAbstract\n\nActive learning enables us to reduce the annotation cost by adaptively selecting\nunlabeled instances to be labeled. For pool-based active learning, several effec-\ntive methods with theoretical guarantees have been developed through maximiz-\ning some utility function satisfying adaptive submodularity. In contrast, there have\nbeen few methods for stream-based active learning based on adaptive submodu-\nlarity. In this paper, we propose a new class of utility functions, policy-adaptive\nsubmodular functions, which includes many existing adaptive submodular func-\ntions appearing in real world problems. We provide a general framework based\non policy-adaptive submodularity that makes it possible to convert existing pool-\nbased methods to stream-based methods and give theoretical guarantees on their\nperformance. In addition we empirically demonstrate their effectiveness by com-\nparing with existing heuristics on common benchmark datasets.\n\n1\n\nIntroduction\n\nActive learning is a problem setting for sequentially selecting unlabeled instances to be labeled, and\nit has been studied with much practical interest as an ef\ufb01cient way to reduce the annotation cost. One\nof the most popular settings of active learning is the pool-based one, in which the learner is given\nthe entire set of unlabeled instances in advance, and iteratively selects an instance to be labeled next.\nThe stream-based setting, which we deal with in this paper, is another important setting of active\nlearning, in which the entire set of unlabeled instances are hidden initially, and presented one by one\nto the learner. This setting also has many real world applications, for example, sentiment analysis of\nweb stream data [26], spam \ufb01ltering [25], part-of-speech tagging [10], and video surveillance [23].\nAdaptive submodularity [19] is an adaptive extension of submodularity, a natural diminishing return\ncondition. It provides a framework for designing effective algorithms for several adaptive problems\nincluding pool-based active learning. For instance, the ones for noiseless active learning [19, 21]\nand the ones for noisy active learning [20, 9, 8] have been developed in recent years. Not only they\nhave strong theoretical guarantees on their performance, but they perform well in practice compared\nwith existing widely-used heuristics.\nIn spite of its considerable success in the pool-based setting, little is known about bene\ufb01ts of adaptive\nsubmodularity in the stream-based setting. This paper answers the question: is it possible to con-\nstruct algorithms for stream-based active learning based on adaptive submodularity? We propose a\ngeneral framework for creating stream-based algorithms from existing pool-based algorithms.\nIn this paper, we tackle the problem of stream-based active learning with a limited budget for making\nqueries. The goal is collecting an informative set of labeled instances from a data stream of a\ncertain length. The stream-based active learning problem has been typically studied in two settings:\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fthe stream setting and the secretary setting, which correspond to memory constraints and timing\nconstraints respectively; we treat both in this paper.\nWe formalize these problems as the adaptive stochastic maximization problem in the stream or sec-\nretary setting. For solving this problem, we propose a new class of stochastic utility functions:\npolicy-adaptive submodular functions, which is another adaptive extension of submodularity. We\nprove this class includes many existing adaptive submodular functions used in various applications.\nAssuming the objective function satis\ufb01es policy-adaptive submodularity, we propose simple meth-\nods for each problem, and give theoretical guarantees on their performance in comparison to the\noptimal pool-based method. Experiments conducted on benchmark datasets show the effectiveness\nof our methods compared with several heuristics. Due to our framework, many algorithms developed\nin the pool-based setting can be converted to the stream-based setting.\nIn summary, our main contributions are the following:\n(cid:15) We provide a general framework that captures budgeted stream-based active learning and other\n(cid:15) We propose a new class of stochastic utility functions, policy-adaptive submodular functions,\nwhich is a subclass of the adaptive submodular functions, and prove this class includes many\nexisting adaptive submodular functions in real world problems.\n(cid:15) We propose two simple algorithms, AdaptiveStream and AdaptiveSecretary, and give the-\n\napplications.\n\noretical performance guarantees on them.\n\n2 Problem Settings\n\nIn this section, we \ufb01rst describe the general framework, then illustrate applications including stream-\nbased active learning.\n\n2.1 Adaptive Stochastic Maximization in the Stream and Secretary Settings\n\nHere we specify the problem statement. This problem is a generalization of budgeted stream-based\nactive learning and other applications.\nLet V = fv1;(cid:1)(cid:1)(cid:1) ; vng denote the entire set of n items, and each item vi is in a particular state out of\nthe set Y of possible states. Denote by \u03d5 : V ! Y a realization of the states of the items. Let (cid:8) be a\nrandom realization, and Yi a random variable representing the state of each item vi for i = 1;(cid:1)(cid:1)(cid:1) ; n,\ni.e., Yi = (cid:8)(vi). Assume that \u03d5 is generated from a known prior distribution p(\u03d5). Suppose the\nstate Yi is revealed when vi is selected. Let A : A ! Y denote the partial realization obtained after\nthe states of items A (cid:18) V are observed. Note that a partial realization A can be regarded as the set\nof observations f(s; A(s)) j s 2 Ag (cid:18) V (cid:2) Y.\nWe are given a set function1 f : 2V (cid:2)Y ! R(cid:21)0 that de\ufb01nes the utility of observations made when\nsome items are selected. Consider iteratively selecting an item to observe its state and aiming to\nmake observations of high utility value. A policy (cid:25) is some decision tree that represents a strategy\nfor adaptively selecting items. Formally it is de\ufb01ned to be a partial mapping that determines an\nitem to be selected next from the observations made so far. Given some budget k 2 Z>0, the goal\nis constructing a policy (cid:25) maximizing E(cid:8)[f ( ((cid:25); (cid:8)))] subject to j ((cid:25); \u03d5)j (cid:20) k for all \u03d5 where\n ((cid:25); \u03d5) denotes the observations obtained by executing policy (cid:25) under realization \u03d5.\nThis problem has been studied mainly in the pool-based setting, where we are given the entire set V\nfrom the beginning and adaptively observe the states of items in any order. In this paper we tackle\nthe stream-based setting, where the items are hidden initially and arrive one by one. The stream-\nbased setting arises in two kinds of scenarios: one is the stream setting2, in which we can postpone\ndeciding whether or not to select an item by keeping it in a limited amount of memory, and at any\ntime observe the state of the stored items. The other is the secretary setting, in which we must decide\n\n1In the original de\ufb01nition of stochastic utility functions [19], the objective value depends not only on the\npartial realization , but also on the realization \u03d5. However, given such f : 2V (cid:2) Y V ! R(cid:21)0, we can rede\ufb01ne\n~f : 2V (cid:2)Y ! R(cid:21)0 as ~f ( A) = E(cid:8)[f (A; (cid:8)) j (cid:8) (cid:24) p((cid:8)j A)], and it does not critically change the overall\ndiscussion in our problem settings. Thus for notational convenience, we use the simpler de\ufb01nition.\n\n2In this paper, \u201cstream-based setting\u201d and \u201cstream setting\u201d are distinguished.\n\n2\n\n\f(a) A policy tree for the pool-based setting\n\n(b) A policy tree for the stream-based setting\n\nFigure 1: Examples of a pool-based policy and a stream-based policy in the case of Y = f+1;(cid:0)1g.\n(a) A pool-based policy can select items in an arbitrary order. (b) A stream-based policy must select\nitems under memory or timing constraints taking account of only items that arrived so far.\n\nimmediately whether or not to select an item at each arrival. In both settings we assume the items\narrive in a random order. The comparison of policies for the pool-based and stream-based settings\nis indicated in Figure 1.\n\n2.2 Budgeted Stream-based Active Learning\n\nWe consider a problem setting called Bayesian active learning. Here V represents the set of in-\nstances, Y1;(cid:1)(cid:1)(cid:1) ; Yn the initially unknown labels of the instances, and Y the set of possible labels.\nLet H denote the set of candidates for the randomly generated true hypothesis H, and pH denote\na prior probability over H. When observations of the labels are noiseless, every hypothesis h 2 H\nrepresents a particular realization, i.e., h corresponds to some \u03d5 2 Y V . When observations are noisy,\nthe probability distribution P[Y1;(cid:1)(cid:1)(cid:1) ; YnjH = h] of the labels is not necessarily deterministic for\neach h 2 H. In both cases, we can iteratively select an instance and query its label to the annotation\noracle. The objective is to determine the true hypothesis or one whose prediction error is small. Both\nthe pool-based and stream-based settings have been extensively studied. The stream-based setting\ncontains the stream and secretary settings, both of which have a lot of real world applications.\nA common approach for devising a pool-based algorithm is designing some utility function that\nrepresents the informativeness of a set of labeled instances, and greedily selecting the instance max-\nimizing this utility in terms of the expected value. We introduce the utility into stream-based active\nlearning, and aim to collect k labeled instances of high utility where k 2 Z>0 is the budget on\nthe number of queries. While most of the theoretical results for stream-based active learning are\nobtained assuming the data stream is in\ufb01nite, we assume the length of the total data stream is given\nin advance.\n\n2.3 Other Applications\n\nWe give a brief sketch of two examples that can be formalized as the adaptive stochastic maximiza-\ntion problem in the secretary setting. Both are variations for streaming data of the problems \ufb01rst\nproposed by Golovin and Krause [19].\nOne is adaptive viral marketing whose aim is spreading information about a new product through\nsocial networks. In this problem we adaptively select k people to whom a free promotional sample\nof the product is offered so as to let them recommend the product to their friends. We cannot know if\nhe recommends the product before actually offering a sample to each. The objective is maximizing\nthe number of people that information of the product reaches. There arise some situations where\npeople come sequentially, and at each arrival we must decide whether or not to offer a sample to\nthem.\nAnother is adaptive sensor placement. We want to adaptively place k unreliable sensors to cover the\ninformation obtained by them. The informativeness of each sensor is unknown before its deploy-\n\n3\n\nv1v2v3v4v5v6v7+1(cid:0)1+1(cid:0)1+1(cid:0)1v1v2v3v4v5v6v7v1v2v3v4v5v6v7t+1(cid:0)1+1(cid:0)1+1(cid:0)1\fment. We can consider the cases where the timing of placing sensors at each location is restricted\nfor some reasons such as transportation cost.\n\n3 Policy-Adaptive Submodularity\n\nIn this section, we discuss conditions satis\ufb01ed by the utility functions of adaptive stochastic maxi-\nmization problems.\nSubmodularity [17] is known as a natural diminishing return condition satis\ufb01ed by various set func-\ntions appearing in a lot of applications, and adaptive submodularity was proposed by Golovin and\nKrause [19] as an adaptive extension of submodularity. Adaptive submodularity is de\ufb01ned as the\ndiminishing return property about the expected marginal gain of a single item, i.e., \u2206(sj A) (cid:21)\n\u2206(sj B) for any partial realization A (cid:18) B and item s 2 V n B, where\n\n\u2206(sj ) = E(cid:8)[f ( [ f(s; (cid:8)(s))g) (cid:0) f ( ) j (cid:8) (cid:24) p((cid:8)j )]:\n\nSimilarly, adaptive monotonicity, an adaptive analog of monotonicity, is de\ufb01ned to be \u2206(sj A) (cid:21) 0\nfor any partial realization A and item s 2 V . It is known that many utility functions used in the\nabove applications satisfy the adaptive submodularity and the adaptive monotonicity. In the pool-\nbased setting, greedily selecting the item of the maximal expected marginal gain yields (1 (cid:0) 1=e)-\napproximation if the objective function is adaptive submodular and adaptive monotone [19].\nHere we propose a new class of stochastic utility functions, policy-adaptive submodular functions.\nLet range((cid:25)) denote the set containing all items that (cid:25) selects for some \u03d5, and we de\ufb01ne policy-\nadaptive submodularity as the diminishing return property about the expected marginal gain of any\npolicy as follows.\nDe\ufb01nition 3.1 (Policy-adaptive submodularity). A set function f : 2V (cid:2)Y ! R(cid:21)0 is policy-adaptive\nsubmodular with respect to a prior distribution p(\u03d5), or (f; p) is policy-adaptive submodular, if\n\u2206((cid:25)j A) (cid:21) \u2206((cid:25)j B) holds for any partial realization A, B and policy (cid:25) such that A (cid:18) B\nand range((cid:25)) (cid:18) V n B, where\n\n\u2206((cid:25)j ) = E(cid:8)[f ( [ ((cid:25); (cid:8))) (cid:0) f ( ) j (cid:8) (cid:24) p((cid:8)j )]:\n\nSince a single item can be regarded as a policy selecting only one item, policy-adaptive submodu-\nlarity is a stricter condition than adaptive submodularity.\nPolicy-adaptive submodularity is also a natural extension of submodularity. The submodularity of a\nset function f : 2V ! R(cid:21)0 is de\ufb01ned as the condition that f (A[fsg)(cid:0)f (A) (cid:21) f (B[fsg)(cid:0)f (B)\nfor any A (cid:18) B (cid:18) V and s 2 V n B, which is equivalent to the condition that f (A [ P ) (cid:0) f (A) (cid:21)\nf (B [ P ) (cid:0) f (B) for any A (cid:18) B (cid:18) V and P (cid:18) V n B. Adaptive extensions of these conditions\nare adaptive submodularity and policy-adaptive submodularity respectively. Nevertheless there is\na counterexample to the equivalence of adaptive submodularity and policy-adaptive submodularity,\nwhich is given in the supplementary materials.\nSurprisingly, many existing adaptive submodular functions in applications also satisfy the policy-\nadaptive submodularity.\nIn active learning, the objective function of generalized binary search\n[12, 19], EC2 [20], ALuMA [21], and the maximum Gibbs error criterion [9, 8] are not only adaptive\nsubmodular, but policy-adaptive submodular. In other applications including in\ufb02uence maximiza-\ntion and sensor placements, it is often assumed that the variables Y1;(cid:1)(cid:1)(cid:1) ; Yn are independent, and\nthe policy-adaptive submodularity always holds in this case. The proofs of these propositions are\ngiven in the supplementary materials.\nTo give the theoretical guarantees for the algorithms introduced in the next section, we assume\nnot only the adaptive submodularity and the adaptive monotonicity, but also the policy-adaptive\nsubmodularity. However, our theoretical analyses can still be applied to many applications.\n\n4 Algorithms\n\nIn this section we describe our proposed algorithms for each of the stream and secretary settings, and\nstate the theoretical guarantees on their performance. The full versions of pseudocodes are given in\nthe supplementary materials.\n\n4\n\n\fAlgorithm 1 AdaptiveStream algorithm & AdaptiveSecretary algorithm\nInput: A set function f : 2V (cid:2)Y ! R(cid:21)0 and a prior distribution p(\u03d5) such that (f; p) is policy-\nadaptive submodular and adaptive monotone. The number of items in the entire stream n 2 Z>0.\nA budget k 2 Z>0. Randomly permuted stream of the items, denoted by (s1;(cid:1)(cid:1)(cid:1) ; sn).\n{\nselecting the item of the largest expected marginal gain (AdaptiveStream)\napplying the classical secretary algorithm\nObserve the state y of item s and let l := l(cid:0)1 [ f(s; y)g.\n\nOutput: Some observations k (cid:18) V (cid:2) Y such that j kj (cid:20) k.\n1: Let 0 := \u2205.\n2: for each segment Sl = fsi j (l (cid:0) 1)n=k < i (cid:20) ln=kg do\n3:\n\n(AdaptiveSecretary)\n\nSelect an item s out of Sl by\n\n4:\n5: return k as the solution\n\n4.1 Algorithm for the Stream Setting\n\nk\n\nThe main idea of our proposed method is simple: divide the entire stream into k segments and select\nthe best item from each one. For simplicity, we consider the case where n is a multiple integer of\nk. If n is not, we can add k\u2308 n\n\u2309 (cid:0) n dummy items with no bene\ufb01t and prove the same guarantee.\nOur algorithm \ufb01rst divides the item sequence s1;(cid:1)(cid:1)(cid:1) ; sn into Sl = fsi j (l (cid:0) 1)n=k < i (cid:20) ln=kg\nfor l = 1;(cid:1)(cid:1)(cid:1) ; k. In each segment, the algorithm selects the item of the largest expected marginal\ngain, that is, argmaxf\u2206(sj l(cid:0)1) j s 2 Slg where l(cid:0)1 is the partial realization obtained before\nthe lth segment. This can be implemented with only O(1) space by storing only the item of the\nmaximal expected marginal gain so far in the current segment. We provide the theoretical guarantee\non the performance of this algorithm by utilizing the policy-adaptive submodularity of the objective\nfunction.\nTheorem 4.1. Suppose f : 2V (cid:2)Y ! R(cid:21)0 is policy-adaptive submodular and adaptive monotone\nw.r.t. a prior p(\u03d5). Assume the items come sequentially in a random order. For any policy (cid:25) such\nthat j ((cid:25); \u03d5)j (cid:20) k holds for all \u03d5, AdaptiveStream selects k items using O(1) space and achieves\nat least 0:16 times the expected total gain of (cid:25) in expectation.\n\n4.2 Algorithm for the Secretary Setting\n\nThough our proposed algorithm for the secretary setting is similar in its approach to the one for the\nstream setting, it is impossible to select the item of the maximal expected marginal gain from each\nsegment in the secretary setting. Then we use classical secretary algorithm [13] as a subroutine to\nobtain the maximal item at least with some constant probability. The classical secretary algorithm\nlets the \ufb01rst \u230an=(ek)\u230b items pass and then selects the \ufb01rst item whose value is larger than all items\nso far. The probability that this subroutine selects the item of the largest expected marginal gain is\nat least 1=e at each segment. This algorithm can be viewed as an adaptive version of the algorithm\nfor the monotone submodular secretary problem [3]. We give the guarantee similar to the one for\nthe stream setting.\nTheorem 4.2. Suppose f : 2V (cid:2)Y ! R(cid:21)0 is policy-adaptive submodular and adaptive monotone\nw.r.t. a prior p(\u03d5). Assume the items come sequentially in a random order. For any policy (cid:25) such\nthat j ((cid:25); \u03d5)j (cid:20) k holds for all \u03d5, AdaptiveSecretary selects at most k items and achieves at\nleast 0:08 times the expected total gain of (cid:25) in expectation.\n\n5 Overview of Theoretical Analysis\n\nIn this section we brie\ufb02y describe the proofs of Theorem 4.1 and 4.2, and compare our techniques\nwith the previous work. The full proofs are given in the supplementary materials.\nThe methods used in the proofs of both theorems are almost the same. They consist of two steps:\nin the \ufb01rst step, we bound the expected marginal gain of each item and in the second step, we take\nsummation of one step marginal gains and derive the overall bound for the algorithms. Though\nour techniques used in the second step are taken from the previous work [3], the \ufb01rst step contains\nseveral novel techniques.\n\n5\n\n\fT directly. With the adaptive monotonicity, we obtain \u2206i (cid:21) (1 (cid:0) exp((cid:0) k\n(cid:3)\ni(cid:0)1))=k where favg((cid:25)) = E(cid:8)[f ( ((cid:25); (cid:8)))].\n\nLet \u2206i be the expected marginal gain of an item picked from the ith segment Si. First we bound it\n(cid:3)\nfrom below with the difference between the optimal pool-based policy (cid:25)\nT for selecting k items from\ni(cid:0)1 that encodes the algorithm until i (cid:0) 1th step under a permutation (cid:27) in which\nT and the policy (cid:25)(cid:27)\nthe items arrive. For the non-adaptive setting, the items in the optimal set are distributed among\nthe segments uniformly at random, then we can evaluate \u2206i by considering whether Si contains\nan item included in the optimal set [3]. On the other hand, in the adaptive setting, it is dif\ufb01cult to\n(cid:3)\nT is distributed in the unarrived items because the policy is closely related not only\nconsider how (cid:25)\nto the contained items but also to the order of items. Then we compare \u2206i and the marginal gain\nT ) (cid:0)\n(cid:3)\nof (cid:25)\nfavg((cid:25)(cid:27)\n(cid:3)\nNext we bound favg((cid:25)\nV that selects k items from V . For the\nnon-adaptive setting, we can apply a widely-used lemma proved by Feige, Mirrokni, and Vondr\u00e1k\n[15]. This lemma provides a bound for the expected value of a randomly deleted subset. To extend\nthis lemma to the adaptive setting, we de\ufb01ne a partially deleted policy tree, grafted policy, and prove\nthe adaptive version of the lemma with the policy-adaptive submodularity. From this lemma we can\n(cid:3)\nobtain the bound E(cid:27)[favg((cid:25)\nV )=k. We also provide an example that shows\nadaptive submodularity is not enough to prove this lemma.\nSumming the bounds for each one-step expected marginal gain until lth step (l is speci\ufb01ed in the\nfull proof for optimizing the resulting guarantees), we can conclude that our proposed algorithms\nachieve some constant factor approximation in comparison to the optimal pool-based policy. Though\nAdaptiveSecretary is the adaptive version of the existing algorithm, our resulting constant factor\nis a little worse than the original (1 (cid:0) 1=e)=7 due to the above new analyses.\n\nk(cid:0)i+1 ))(favg((cid:25)\n\n(cid:3)\nT ) with the optimal pool-based policy (cid:25)\n\nT )] (cid:21) (k (cid:0) i + 1)favg((cid:25)\n(cid:3)\n\n6 Experiments\n\n6.1 Experimental Setting\n\nWe conducted experiments on budgeted active learning in the following three settings: the pool-\nbased, stream, and secretary settings. For each setting, we compare two methods: one is based\non the policy-adaptive submodularity and the other is based on uncertainty sampling as baselines.\nUncertainty sampling is the most widely-used approach in applications. Selecting random instances,\nwhich we call random, is also implemented as another baseline that can be used in every setting.\nWe select ALuMA [21] out of several pool-based methods based on adaptive submodularity, and\nconvert it to the stream and secretary settings with AdaptiveStream and AdaptiveSecretary,\nwhich we call stream submodular and secretary submodular respectively. For comparison, we\nalso implement the original pool-based method, which we call pool submodular. Though ALuMA\nis designed for the noiseless case, there is a modi\ufb01cation method that makes its hypotheses sampling\nmore noise-tolerant [7], which we employ. The number of hypotheses sampled at each time is set\nN = 1000 in all settings.\nFor the pool-based setting, uncertainty sampling is widely-known as a generic and easy-to-\nimplement heuristic in many applications. This selects the most uncertain instance, i.e., the instance\nthat is closest to the current linear separator. In contrast, there is no standard heuristic for the stream\nand secretary settings. We apply the same conversion to the pool-based uncertain sampling method\nas AdaptiveStream and AdaptiveSecretary, i.e., in the stream setting, selecting the most un-\ncertain instance from the segment at each step, and in the secretary setting, running the classical\nsecretary algorithm to select the most uncertain instance at least with probability 1=e. A similar one\nto this approach in the stream setting is used in some applications [26]. In every setting, we \ufb01rst\nrandomly select 10 instances for the initial training of a classi\ufb01er and after that, select k (cid:0) 10 in-\nstances with each method. We use the linear SVM trained with instances labeled so far to judge the\nuncertainty. We call these methods pool uncertainty, stream uncertainty, secretary uncertainty\nrespectively, and use them as baselines.\nWe conducted experiments on two benchmark datasets, WDBC3 and MNIST4. The WDBC dataset\ncontains 569 instances, each of which consists of 32-dimensional features of cells and their diagnosis\n\n3https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)\n4http://yann.lecun.com/exdb/mnist/\n\n6\n\n\f(a) WDBC dataset, error\n\n(b) MNIST dataset, error\n\n(c) WDBC dataset, convergence\n\n(d) MNIST dataset, convergence\n\nFigure 2: Experimental results\n\nresults. From the MNIST dataset, the dataset of handwritten digits, we extract 14780 images of the\ntwo classes, 0 and 1, so as to consider the binary classi\ufb01cation problem, and apply PCA to reduce\nits dimensions from 784 to 10. We standardize both datasets so that the values of each feature have\nzero mean and unit variance.\nWe evaluate the performance through 100 trials, where at each time an order in which the instances\narrive is generated randomly. For all the methods, we calculate the error rate by training linear SVM\nwith the obtained labeled instances and testing with the entire dataset.\n\n6.2 Experimental Results\n\nFigure 2(a)(b) illustrate the average error rate achieved by each method with budget k = 30; 40; 50.\nOur methods stream submodular and secretary submodular outperform not only random, but\nalso stream uncertainty and secretary uncertainty respectively, i.e., the methods based on policy-\nadaptive submodularity perform better than the methods based on uncertainty sampling in each of\nthe stream and secretary settings. Moreover, we can observe our methods are stabler than the other\nmethods from the error bars representing the standard deviation.\nFigure 2(c)(d) show how the error rate decreases as labels are queried in the case of k = 50. In\nboth datasets, we can observe the performance of stream submodular is competitive with pool\nsubmodular.\n\n7 Related Work\n\nStream-based active learning. Much amount of work has been dedicated to devising algorithms\nfor stream-based active learning (also known as selective sampling) from both the theoretical and\npractical aspects. From the theoretical aspects, several bounds on the label complexity have been\nprovided [16, 2, 4], but their interest lies in the guarantees compared to the passive learning, not the\noptimal algorithm. From the practical aspects, it has been applied to many real world problems such\nas sentiment analysis of web stream data [26], spam \ufb01ltering [25], part-of-speech tagging [10], and\nvideo surveillance [23], but there is no de\ufb01nitive widely-used heuristic.\n\n7\n\nk=30k=40k=50Budgetonthenumberofqueries0.000.020.040.060.080.100.120.14Errorrate101520253035404550Numberoflabelsobtained0.000.020.040.060.080.100.12Errorratek=30k=40k=50Budgetonthenumberofqueries0.0000.0050.0100.0150.0200.0250.030Errorrate101520253035404550Numberoflabelsobtained0.000.020.040.060.080.100.120.14Errorraterandompooluncertaintypoolsubmodularstreamuncertaintystreamsubmodularsecretaryuncertaintysecretarysubmodularrandompooluncertaintypoolsubmodularstreamuncertaintystreamsubmodularsecretaryuncertaintysecretarysubmodular\fOf particular relevance to our work is the one presented by Sabato and Hess [24]. They devised\ngeneral methods for constructing stream-based algorithms satisfying a budget based on pool-based\nalgorithms, but their theoretical guarantees are bounding the length of the stream needed to emulate\nthe pool-based algorithm, which is a large difference from our work. Das et al. [11] designed the\nalgorithm for adaptively collecting water samples, referring to the submodular secretary problem,\nbut they focused on applications to marine ecosystem monitoring, and did not give any theoretical\nanalysis about its performance.\nAdaptive submodular maximization. The framework of adaptive submodularity, which is an adap-\ntive counterpart of submodularity, is established by Golovin and Krause [19]. It provides the simple\ngreedy algorithm with the near-optimal guarantees in several adaptive real world problems. Speci\ufb01-\ncally it achieves remarkable success in pool-based active learning. For the noiseless cases, Golovin\nand Krause [19] described the generalized binary search algorithm [12] as the greedy algorithm\nfor some adaptive submodular function, and improved its approximation factor. Golovin et al. [20]\nprovided an algorithm for Bayesian active learning with noisy observations by reducing it to the\nequivalence class determination problem. On the other hand, there have been several studies on\nadaptive submodular maximization in other settings, for example, selecting multiple instances at the\nsame time before observing their states [7], guessing an unknown prior distribution in the bandit\nsetting [18], and maximizing non-monotone adaptive submodular functions [22].\nSubmodular maximization in the stream and secretary settings. Submodular maximization in\nthe stream setting, called streaming submodular maximization, has been studied under several con-\nstraints. Badanidiyuru et al. [1] provided a (1=2 (cid:0) \u03f5)-approximation algorithm that can be executed\nin O(k log k) space for the cardinality constraint. For more general constraints including matching\nand multiple matroids constraints, Chakrabarti and Kale [5] proposed constant factor approximation\nalgorithms. Chekuri et al. [6] devised algorithms for non-monotone submodular functions.\nOn the other hand, much effort is also devoted to submodular maximization in the secretary set-\nting, called submodular secretary problem, under various constraints. Bateni et al. [3] speci\ufb01ed the\nproblem \ufb01rst and provided algorithms for both monotone and non-monotone submodular secretary\nproblems under several constraints, one of which our methods are based on. Feldman et al. [14]\nimproved constant factors of the theoretical guarantees for monotone cases.\n\n8 Concluding Remarks\n\nIn this paper, we investigated stream-based active learning with a budget constraint in the view of\nadaptive submodular maximization. To tackle this problem, we introduced the adaptive stochastic\nmaximization problem in the stream and secretary settings, which can formalize stream-based active\nlearning. We provided a new class of objective functions, policy-adaptive submodular functions, and\nshowed this class contains many utility functions that have been used in pool-based active learning\nand other applications. AdaptiveStream and AdaptiveSecretary, which we proposed in this pa-\nper, are simple algorithms guaranteed to be constant factor competitive with the optimal pool-based\npolicy. We empirically demonstrated their performance by applying our algorithms to the budgeted\nstream-based active learning problem, and our experimental results indicate their effectiveness com-\npared to the existing methods.\nThere are two natural directions for future work. One is exploring the possibility of the concept,\npolicy-adaptive submodularity. By studying the nature of this class, we can probably yield theoreti-\ncal insight for other problems. Another is further developing the practical aspects of our results. In\nreal world problems sometimes it happens that the items arrive not in a random order. For example,\nin sequential adaptive sensor placement [11], an order of items is restricted to some transportation\nconstraint. In this setting our guarantees do not hold and another algorithm is needed. In contrast to\nthe non-adaptive setting, even in the stream setting, it seems much more dif\ufb01cult to design a constant\nfactor approximation algorithm because the full information of each item is totally revealed when\nits state is observed and memory is not so powerful as in the non-adaptive setting.\n\nAcknowledgments\n\nThe second author is supported by Grant-in-Aid for Scienti\ufb01c Research on Innovative Areas, Explo-\nration of nanostructure-property relationships for materials innovation.\n\n8\n\n\fReferences\n[1] A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, and A. Krause. Streaming submodular maximization:\nMassive data summarization on the \ufb02y. Proceedings of the 20th ACM SIGKDD International Conference\non Knowledge Discovery and Data Mining (KDD), pp. 671\u2013680, 2014.\n\n[2] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. Proceedings of the 23rd Inter-\n\nnational Conference on Machine Learning (ICML), pp. 65\u201372, 2006.\n\n[3] M. Bateni, M. Hajiaghayi, and M. Zadimoghaddam. Submodular secretary problem and extensions. ACM\n\nTransactions on Algorithms (TALG), 9(4):32, 2013.\n\n[4] A. Beygelzimer, S. Dasgupta, and J. Langford. Importance-weighted active learning. Proceedings of the\n\n26th International Conference on Machine Learning (ICML), pp. 49\u201356, 2009.\n\n[5] A. Chakrabarti and S. Kale. Submodular maximization meets streaming: Matchings, matroids, and more.\n\nMathematical Programming Series B, 154(1), pp. 225\u2013247, 2015.\n\n[6] C. Chekuri, S. Gupta, and K. Quanrud. Streaming algorithms for submodular function maximization.\n\nAutomata, Languages, and Programming (ICALP), pp. 318\u2013330, 2015.\n\n[7] Y. Chen and A. Krause. Near-optimal batch mode active learning and adaptive submodular optimization.\n\nProceedings of the 30th International Conference on Machine Learning (ICML), pp. 160\u2013168, 2013.\n\n[8] N. V. Cuong, W. S. Lee, and N. Ye. Near-optimal adaptive pool-based active learning with general loss.\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), 2014.\n\n[9] N. V. Cuong, W. S. Lee, N. Ye, K. M. A. Chai, and H. L. Chieu. Active learning for probabilistic hy-\npotheses using the maximum Gibbs error criterion. Advances in Neural Information Processing Systems\n(NIPS), pp. 1457\u20131465, 2013.\n\n[10] I. Dagan and S. Engelson. Committee-based sampling for training probabilistic classi\ufb01ers. Proceedings\n\nof the 12th International Conference on Machine Learning (ICML), pp. 150\u2013157, 1995.\n\n[11] J. Das, F. Py, J. B. J. Harvey, J. P. Ryan, A. Gellene, R. Graham, D. A. Caron, K. Rajan, and G. S.\nSukhatme. Data-driven robotic sampling for marine ecosystem monitoring. The International Journal of\nRobotics Research, 34(12), pp. 1435\u20131452, 2015.\n\n[12] S. Dasgupta. Analysis of a greedy active learning strategy. Advances in Neural Information Processing\n\nSystems (NIPS), pp. 337\u2013344, 2004.\n\n[13] E. B. Dynkin. The optimum choice of the instant for stopping a Markov process. Soviet Math. Dokl, 4,\n\npp. 627\u2013629, 1963.\n\n[14] M. Feldman, J. S. Naor, and R. Schwartz. Improved competitive ratios for submodular secretary prob-\nlems. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques\n(APPROX-RANDOM), pp. 218\u2013229, 2011.\n\n[15] U. Feige, V. Mirrokni, and J. Vondr\u00e1k. Maximizing non-monotone submodular functions. SIAM Journal\n\non Computing, 40(4), pp. 1133\u20131153, 2011.\n\n[16] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee\n\nalgorithm. Machine Learning, 28, pp. 133\u2013168, 1997.\n\n[17] S. Fujishige. Submodular Functions and Optimization, Second Edition. Annals of Discrete Mathematics,\n\nVol. 58, Elsevier, 2005.\n\n[18] V. Gabillon, B. Kveton, Z. Wen, B. Eriksson, and S. Muthukrishnan. Adaptive submodular maximization\n\nin bandit setting. Advances in Neural Information Processing Systems (NIPS), pp. 2697\u20132705, 2013.\n\n[19] D. Golovin and A. Krause. Adaptive submodularity: Theory and applications in active learning and\n\nstochastic optimization. Journal of Arti\ufb01cial Intelligence Research (JAIR), 42, pp. 427\u2013486, 2011.\n\n[20] D. Golovin, A. Krause, and D. Ray. Near-optimal Bayesian active learning with noisy observations.\n\nAdvances in Neural Information Processing Systems (NIPS), pp. 766\u2013774, 2010.\n\n[21] A. Gonen, S. Sabato, and S. Shalev-Shwartz. Ef\ufb01cient active learning of halfspaces: An aggressive ap-\n\nproach. The Journal of Machine Learning Research (JMLR), 14(1), pp. 2583\u20132615, 2013.\n\n[22] A. Gotovos, A. Karbasi, and A. Krause. Non-monotone adaptive submodular maximization. Proceedings\n\nof the 24th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pp. 1996\u20132003, 2015.\n\n[23] C. C. Loy, T. M. Hospedales, T. Xiang, and S. Gong. Stream-based joint exploration-exploitation active\nlearning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2012.\n\n[24] S. Sabato and T. Hess. Interactive algorithms: From pool to stream. In Proceedings of the 29th Annual\n\nConference on Learning Theory (COLT), pp. 1419\u20131439, 2016.\n\n[25] D. Sculley. Online active learning methods for fast label-ef\ufb01cient spam \ufb01ltering. Proceedings of Fourth\n\nConference on Email and Anti-Spam (CEAS), 2007.\n\n[26] J. Smailovi\u00b4c, M. Gr\u02c7car, N. Lavra\u02c7c, and M. \u017dnidar\u0161i\u02c7c. Stream-based active learning for sentiment analysis\n\nin the \ufb01nancial domain. Information Sciences, 285, pp. 181\u2013203, 2014.\n\n9\n\n\f", "award": [], "sourceid": 278, "authors": [{"given_name": "Kaito", "family_name": "Fujii", "institution": "Kyoto University"}, {"given_name": "Hisashi", "family_name": "Kashima", "institution": "Kyoto University"}]}