{"title": "Efficient nonmyopic batch active search", "book": "Advances in Neural Information Processing Systems", "page_first": 1099, "page_last": 1109, "abstract": "Active search is a learning paradigm for actively identifying as many members of a given class as possible. A critical target scenario is high-throughput screening for scientific discovery, such as drug or materials discovery. In these settings, specialized instruments can often evaluate \\emph{multiple} points simultaneously; however, all existing work on active search focuses on sequential acquisition. We bridge this gap, addressing batch active search from both the theoretical and practical perspective. We first derive the Bayesian optimal policy for this problem, then prove a lower bound on the performance gap between sequential and batch optimal policies: the ``cost of parallelization.'' We also propose novel, efficient batch policies inspired by state-of-the-art sequential policies, and develop an aggressive pruning technique that can dramatically speed up computation. We conduct thorough experiments on data from three application domains: a citation network, material science, and drug discovery, testing all proposed policies (14 total) with a wide range of batch sizes. Our results demonstrate that the empirical performance gap matches our theoretical bound, that nonmyopic policies usually significantly outperform myopic alternatives, and that diversity is an important consideration for batch policy design.", "full_text": "Ef\ufb01cient nonmyopic batch active search\n\nShali Jiang\nCSE, WUSTL\n\nSt. Louis, MO 63130\njiang.s@wustl.edu\n\nGustavo Malkomes\n\nCSE, WUSTL\n\nSt. Louis, MO 63130\n\nluizgustavo@wustl.edu\n\nBenjamin Moseley\n\nTepper School of Business, CMU and\n\nRelational AI\n\nPittsburgh, PA 15213\n\nmoseleyb@andrew.cmu.edu\n\nMatthew Abbott\nCSE, WUSTL\n\nSt. Louis, MO 63130\nmbabbott@wustl.edu\n\nRoman Garnett\nCSE, WUSTL\n\nSt. Louis, MO 63130\ngarnett@wustl.edu\n\nAbstract\n\nActive search is a learning paradigm for actively identifying as many members of\na given class as possible. A critical target scenario is high-throughput screening\nfor scienti\ufb01c discovery, such as drug or materials discovery. In these settings,\nspecialized instruments can often evaluate multiple points simultaneously; however,\nall existing work on active search focuses on sequential acquisition. We bridge this\ngap, addressing batch active search from both the theoretical and practical perspec-\ntive. We \ufb01rst derive the Bayesian optimal policy for this problem, then prove a\nlower bound on the performance gap between sequential and batch optimal policies:\nthe \u201ccost of parallelization.\u201d We also propose novel, ef\ufb01cient batch policies inspired\nby state-of-the-art sequential policies, and develop an aggressive pruning technique\nthat can dramatically speed up computation. We conduct thorough experiments on\ndata from three application domains: a citation network, material science, and drug\ndiscovery, testing all proposed policies with a wide range of batch sizes. Our results\ndemonstrate that the empirical performance gap matches our theoretical bound,\nthat nonmyopic policies usually signi\ufb01cantly outperform myopic alternatives, and\nthat diversity is an important consideration for batch policy design.\n\n1\n\nIntroduction\n\nIn active search (AS), we seek to sequentially inspect data to discover as many members of a desired\nclass as possible with a limited budget. Formally, suppose we are given a \ufb01nite domain of n elements\nX = {xi}n\ni=1, among which there is a rare, valuable subset R \u2282 X . We call the members of this class\ntargets or positive items. The identities of the targets are unknown a priori, but can be determined by\nquerying an expensive oracle that can compute y = 1{x \u2208 R} for any x \u2208 X . Given a budget T on\nthe number of queries we can provide the oracle, we wish to design a policy that sequentially queries\n\nitems {xt} = {x1, x2, . . . , xT} to maximize the number of targets identi\ufb01ed,(cid:80) yt. Many real-world\n\nproblems can be naturally posed in terms of active search; drug discovery [7, 17, 18], fraud detection,\nand product recommendation [21] are a few examples.\nPrevious work [6] has developed Bayesian optimal policies for active search with a natural utility\nfunction. Not surprisingly, this policy is computationally intractable, requiring cost that grows\nexponentially with the horizon. Therefore the optimal policy must be approximated in practice.\nSeveral approximation schemes for active search have been proposed and studied, including simple\nmyopic lookahead approximations [6]. Jiang et al. [12] recently proposed an ef\ufb01cient, nonmyopic\nsearch (ENS) policy, and demonstrated this policy yields remarkable empirical performance on search\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fproblems from various domains, including drug and materials discovery. Although these policies are\nempirically effective, there is a large theoretical gap between the performance of the optimal policy\nand any ef\ufb01cient approximation: Jiang et al. [12] have shown that it is impossible to \u03b1-approximate\nthe expected performance of the optimal policy for any constant \u03b1 in polynomial time.\nThese previous investigations mentioned above all focused on sequential active search (SAS), where\nwe query one point at a time. However, in many real applications, we can query a batch of multiple\npoints simultaneously. For example, modern high-throughput screening technologies for drug\ndiscovery can process microwell plates containing 96+ compounds at a time. No policies designed\nfor this batch active search setting are currently available. Previous work has produced batch policies\nfor different active learning or search settings, which we will discuss in Section 4.\nWe investigate batch active search (BAS) from both the theoretical and practical perspectives. We\n\ufb01rst derive the Bayesian optimal policy for BAS, and show that its time complexity in general is\ndauntingly high, except in the trivial one-step (myopic) case. We then prove an asymptotic lower\nbound on the expected performance gap between the optimal sequential and batch policies.\nNext we consider practical concerns such as effective policy design. We generalize the recently\nproposed ef\ufb01cient nonmyopic sequential policy (ENS) from [12] to the batch setting. The nonmyopia\nof ENS is automatically inherited, but ef\ufb01ciency is lost as the batch version involves combinatorial\noptimization (i.e., set function maximization). We propose and study two ef\ufb01cient approximation\nstrategies. The \ufb01rst strategy is a sequential simulation, where we simulate sequential ENS to construct\na batch using a \ufb01ctional labeling oracle. The second strategy is greedily maximizing the marginal\ngain to our batch ENS score, motivated by our conjecture that the inherent batch score is submodular.\nWe prove that sequential simulation of the one-step Bayesian optimal policy with a pessimistic oracle\n(i.e., one that always outputs negative labels) near-optimally maximizes the probability that at least\none point in the batch is positive. This theoretical support of pessimism is in contrast to other settings\nsuch as Bayesian optimization, where pessimism has been used as a heuristic for batch policies.\nWe also improve the pruning techniques suggested by Jiang et al. [12] considerably to reduce the\ncomputational overhead of our proposed policies in practice. We demonstrate a connection with lazy\nevaluation [5] and show that our pruning strategy can provide a speedup of over 50 times in a drug\ndiscovery setting.\nFinally, we conduct thorough experiments on data from three domains: a citation network, material\nscience, and drug discovery. In total we study 14 policies: the one-step optimal batch policy, 12\nsequential simulation policies (three sequential policies combined with four \ufb01ctional oracles), and\ngreedy maximization of the batch version of ENS. We observe that ENS-based (nonmyopic) policies\nalmost always provide a signi\ufb01cant improvement in performance. Two policies are particularly\nnotable: sequential simulation of ENS with a pessimistic oracle and greedy maximization of batch\nENS. The latter is shown to be more robust for larger batch sizes.\n\n2 Bayesian optimal batch active search\n\nour preference over different datasets D = (cid:8)(xi, yi)(cid:9) through a natural utility: u(D) = (cid:80) yi,\n\nWe begin our investigation by establishing the optimal policy for batch active search using the\nframework of Bayesian decision theory. To cast batch active search into this framework, we express\nwhich simply counts the number of targets in D. Occasionally we will use the notation u(Y )\nfor u(D) when D = (X, Y ). We now consider the problem of sequentially choosing a set of\nT (a given budget) points D with the goal of maximizing u(D). In the batch setting, for each\nquery we must select a batch of b points and will then observe all their labels at the same time.\nWe use Xi = {xi,1, xi,2, . . . xi,b} to denote a batch of points chosen during the ith iteration, and\nk=1 to denote the\nobserved data after i \u2264 t batch queries, where t = (cid:100)T /b(cid:101).\nWe assume a probability model P is given, providing the posterior marginal probability Pr(y | x,D)\nfor any point x \u2208 X and observed dataset D. At iteration i + 1 (given observations Di), the Bayesian\noptimal policy chooses a batch Xi+1 maximizing the expected utility at termination, recursively\nassuming optimal continued behavior:\n\nYi = {yi,1, yi,2, . . . yi,b} the corresponding labels. We use Di = (cid:8)(Xk, Yk)(cid:9)i\n\nXi+1 = arg max\n\nX\n\n2\n\nE(cid:2)u(Dt \\ Di) | X,Di\n\n(cid:3).\n\n(1)\n\n\fE(cid:2)u(Dt \\ Dt\u22121) | X,Dt\u22121\n\n(cid:3) = E\n\nNote that the additive nature of our chosen utility allows us to ignore the utility of the already gathered\ndata in the expectation.\nTo derive the expected utility, we adopt the standard technique of backward induction, as used by for\nexample Garnett et al. [6] to analyze the sequential case. The base case is when only one batch is left\n(i = t \u2212 1). The expected utility resulting from a proposed \ufb01nal batch X is then\n\nY |X,Dt\u22121\n\n(2)\nwhere E\nY |X,Di is the expectation over the joint posterior distribution of Y (the labels of X) con-\nditioned on Di. In this case, designing the optimal batch (1) by maximizing the expected utility is\ntrivial: we select the points with the highest probabilities of being targets, re\ufb02ecting pure exploitation.\nThis optimal batch can then be found in O(n log b) time using, e.g., min-heap of size b.\nIn general, when i \u2264 t\u2212 1, the expected terminal utility resulting from choosing a batch X at iteration\ni + 1 and acting optimally thereafter can be written as a Bellman equation as follows:\n\nx\u2208X Pr(y = 1 | x,Dt\u22121),\n\n(cid:2)u(Y )(cid:3) =(cid:80)\n\n(cid:104)\n\nmaxX(cid:48) E(cid:2)u(Dt\\Di+1) | X(cid:48),Di+1\n\n(cid:3)(cid:105)\n\nE(cid:2)u(Dt\\Di) | X,Di\n\n(cid:3) =(cid:80)\n\nx\u2208X Pr(y = 1 | x,Di)+E\n\nY |X,Di\n\n,\n(3)\nwhere the \ufb01rst term represents the expected utility resulting immediately from the points in X, and\nthe second part is the expected future utility from the following iterations.\nThe most interesting aspect of the Bayesian optimal policy is that these immediate and future reward\ncomponents in (3) can be interpreted as automatically balancing exploitation (immediate utility) and\nexploration (expected future utility given the information revealed by the present batch).\nHowever, without further assumptions on the joint label distribution P, exact maximization of (3)\nrequires enumerating the whole search tree of the form Di \u2192 Xi+1 \u2192 Yi+1 \u2192 \u00b7\u00b7\u00b7 \u2192 Xt \u2192 Yt. The\n\nbranching factor of the X layers is(cid:0)n\n(cid:1), as we must enumerate all possible batches. The branching\ndaunting O(cid:0)(2n)b(t\u2212i)(cid:1). The running time analysis in [6] is a special case of this result where b = 1.\n\nfactor of the Y layers is 2b, as we must enumerate all possible labelings of a given batch. So the total\ncomplexity of a na\u00efve implementation computing the optimal policy at iteration i + 1 would be a\n\nb\n\nThe optimal policy is clearly computationally infeasible, so we must resort to suboptimal policies\nto proceed in practice. One reasonable and practical alternative is to adopt a myopic lookahead\napproximation to the optimal policy. A greedy (one-step lookahead) approximation, which always\nmaximizes the expected marginal gain in (2), constructs each batch by selecting the points with\nhighest probability of being a target. We will refer to this policy as greedy-batch, and this will serve\nas a natural baseline batch policy for active search.\n\n2.1 Adaptivity gap\n\nFor purely sequential policies (i.e., b = 1), every point is chosen based on a model informed by all\nprevious observations. However, for batch policies (b > 1), points are typically chosen with less\ninformation available. For example, in the extreme case when b = T , every point in our budget\nmust be chosen before we have observed anything, hence we might reasonably expect our search\nperformance to suffer. Clearly there must be an inherent cost to batch policies compared to sequential\npolicies due to a loss of adaptivity. How much is this cost?\nWe have proven the following lower bound on the inherent \u201ccost of parallelism\u201d in active search:\nTheorem 1. There exist active search instances with budget T , such that OPT1\nOPTx is the expected number of targets found by the optimal batch policy with batch size x \u2265 1.\nOPTb\n\n(cid:1), where\n\nis \u2126(cid:0) b\n\nlog T\n\nProof sketch. We construct a special type of active search instance where the location of a large\ntrove of positives is encoded by a binary tree, and a search policy must take the correct path through\nthe tree to decode a treasure map pointing to these points. We design the construction such that a\nsequential policy can easily identify the correct path by walking down the tree directed by the labels\nof queried nodes. A batch policy must waste a lot queries decoding the map as the correct direction\nis only revealed after constructing an entire batch. We show that even the optimal batch policy has\na very low probability of identifying the location of the hidden targets quickly enough, so that the\nexpected utility is much less than that of the optimal sequential policy. A detailed proof is given in\nthe supplementary material.\n\n3\n\n\fThus the expected performance ratio between optimal sequential and batch policies, also known\nas adaptivity gap in the literature [1], is lower bounded linearly in batch size. This theorem is not\nonly of theoretical interest: it can also provide practical guidance on choosing batch sizes. Indeed,\nin drug discovery, modern high-throughput screening technologies provide many choices for batch\nsizes; understanding the inherent loss from choosing larger batch sizes provides valuable information\nregarding the tradeoff between ef\ufb01ciency and cost.\n\n3 Ef\ufb01cient nonmyopic approximations\n\nx\u2208X Pr(y = 1 | x,Di) +\n\nE\nY |X,Di\n\nE[u(Dt \\ Di) | X,Di] =(cid:80)\n\nThe greedy-batch policy is myopic in the sense that each decision represents pure exploitation: the\nfuture reward is always assumed to be zero, and the remaining budget is not taken into consideration.\nHere we will generalize a recently proposed nonmyopic sequential active search policy, ENS [12], to\nthe batch setting and propose two techniques to approximately compute it.\nOur proposed adaptation of ENS to batch setting can be motivated with the following question:\nhow many targets would we expect to \ufb01nd if, after selecting the current batch, we spent the entire\nremaining budget simultaneously? If this were the case, then the maximum future utility could be\ncomputed without recursion:\n\n(4)\nNote the optimal \ufb01nal action simply selects the points with the highest T \u2212 b \u2212 |Di| probabilities,\nallowing the expected future reward to be computed exactly and ef\ufb01ciently. We may use this insight\nto rewrite (4) as (using f (X | Di) as shorthand for the expected utility from selecting X given Di):\n(5)\n\n(cid:2)maxX(cid:48):|X(cid:48)|=T\u2212b\u2212|Di| E(cid:2)u(Y (cid:48)) | X(cid:48),Di, X, Y(cid:3)(cid:3).\n(cid:105)\n\nf (X | Di) =(cid:80)\nHere we have adopted the notation(cid:80)(cid:48)\n\nx\u2208X Pr(y = 1 | x,Di) + E\n\ns from [12] to denote the sum of the top s probabilities over the\nunlabeled points, x(cid:48) \u2208 X \\ (Di \u222a X). Jiang et al. [12] gave a further interpretation of the ENS policy\nas approximating the optimal expected utility (3) by assuming that the remaining unlabeled points\nafter this batch are conditionally independent, so that there is no need to recursively enumerate the\nsearch tree. This assumption might seem unrealistic at \ufb01rst, but when many well-spaced points are\nobserved, we note they might approximately \u201cD-separate\u201d the remaining unlabeled points. Further,\nENS naturally encourages the selection of well-spaced points (targeted exploration) in the initial state\nof the search [12].\nThe nonmyopia of (5) is automatically inherited in generalizing from sequential to batch setting\ndue to explicit budget awareness. Unfortunately, the ef\ufb01ciency of the sequential ENS policy is not\npreserved. Direct maximization of (5) still requires combinatorial search over all subsets of size b.\nMoreover, to evaluate a given batch, we need to enumerate all its possible labelings (2b in total) to\ncompute the expectation in the second term. Accounting for the cost of conditioning and summing\n\nthe top probabilities, the total complexity would be O(cid:0)(2n)b n log T(cid:1).\n\n(cid:104)(cid:80)(cid:48)\n\nY |X,Di\n\nT\u2212b\u2212|Di| Pr (y(cid:48) = 1 | x(cid:48),Di, X, Y )\n\n.\n\nWe propose two strategies to tackle these computational problems below.\n\n3.1 Sequential simulation\n\nThe cost of computing the proposed batch policy has exponential dependence on the batch size\nb > 1. To avoid this, our \ufb01rst idea is to reduce BAS to SAS (b = 1). We select points one at a time\nto add to a batch by maximizing the sequential ENS score (i.e., (5) with b = 1). We then use some\n\ufb01ctional labeling oracle L : X \u2192 {0, 1} to simulate its label and incorporate the observation into\nour dataset. We repeat this procedure until we have selected b points. Note that we could use this\nbasic construction replacing ENS by any other sequential policy \u03c0, such as the one-step or two-step\nBayesian optimal policies [6].\nWe will see that the behavior of the \ufb01ctional labeling oracle has large in\ufb02uence on the behavior\nof resulting search policies. Here we will consider four \ufb01ctional oracles: (1) sampling, where\nwe randomly sample a label from its marginal distribution; (2) most-likely, where we assume the\nmost-likely label; (3) pessimistic, where we always believe all labels are negative; and (4) optimistic,\nwhere always believe all labels are positive.\n\n4\n\n\fSequential simulation is a common heuristic in similar settings like batch Bayesian optimization,\nas we will discuss in detail in the next section. Here we provide some mathematical rationale of\nthis procedure in a special case: the one-step optimal (greedy) search policy combined with the\npessimistic oracle. This proposition is inspired by the work of Wang [23].\nProposition 1. The batch constructed by sequentially simulating the greedy active search policy with\na pessimistic oracle near-optimally maximizes the probability that at least one of the points in the\nbatch is positive, assuming that marginal target probabilities of unlabeled points are nonincreasing\nwhen conditioning on a negative observation.\n\nProof sketch. We show that the probability of a batch having at least one positive is a monotone\nsubmodular set function, and that sequentially simulating the one-step policy with the pessimistic\noracle equivalently maximizes the marginal gain of this function. Therefore, it is near-optimal [16].\nSee the supplementary materials for a formal proof.\nNote the assumption in this proposition simply means the probability model does not involve negative\nlabel correlations; the k-nn model used in our experiments satis\ufb01es this assumption.\nWith this result, it is not hard to see that sequentially simulating the greedy policy with an optimistic\noracle greedily maximizes the probability that all points in the batch are positive. In this case,\nhowever, the corresponding set function is not submodular so we don\u2019t know if there are optimality\nguarantees.\nNote we are not claiming that the objective of \ufb01nding at least one positive serves as a good basis\nfor batch active search; actually as we will see in our experiments, this is often much worse than\nother nonmyopic batch policies we propose. However, we believe this result provides theoretical\ninsight that could shed light on other batch policies under similar settings. For example, this policy\ncan be considered as an active search counterpart of a batch version of probability of improvement for\nBayesian optimization [13].\n\n3.2 Greedy approximation\n\nOur second strategy is motivated by our conjecture that (5) is a monotone submodular function under\nreasonable assumptions. If that is the case, then again a greedy batch construction returns a batch with\nnear-optimal score [16]. We therefore propose to use a greedy algorithm to sequentially construct\nthe batch by maximizing the marginal gain. That is, we begin with an empty batch X = \u2205. We then\nsequentially add b points by adding the point maximizing the marginal gain:\n\nx = arg maxx \u2206f (x | X),\n\n(6)\n\nwhere\n\n\u2206f (x | X) = f (X \u222a {x} | Di) \u2212 f (X | Di).\n\n(7)\nWhen b is large, this procedure is still expensive to compute due to the expectation term in (5),\nrequiring O(2b) operations to compute exactly. Here we approximate the expectation using Monte\nCarlo sampling with a small set of samples of the labels. Speci\ufb01cally, given a batch of points X, we\napproximate (5) with samples S = { \u02dcY : \u02dcY \u223c Y | X,Di}:\n\n(cid:80)\n\n(cid:104)(cid:80)(cid:48)\n\n(cid:105)\nT\u2212b\u2212|Di| Pr (y(cid:48) = 1 | x(cid:48),Di, X, Y )\n\n. (8)\n\nf (X | Di) \u2248(cid:80)\n\nx\u2208X Pr(y = 1 | x,Di)+ 1|S|\n\nY \u2208S\n\nWe will call the batch policy described above batch-ENS. Note batch-ENS using one sample of the\nlabels in a batch is similar to sequential simulation of ENS with the sampling oracle, though the two\npolicies are motivated in different ways.\n\n3.3\n\nImplementation and pruning\n\nWe adopt k-nn as our probability model, so the tricks described in 3.2 of [12] can be adopted. The\ntime complexity per iteration for sequential simulation of ENS is O(n(log n+m log m+T )), where n\nis the total number of points, m is the maximum number of points that can be in\ufb02uenced by any point\nwith the k-nn model, and T is the total budget. For batch-ENS, the time complexity at each iteration is\nalso the same as ENS for each sample, so the complexity increases linearly as the number of samples.\nNote with 2s samples, (8) can be exactly computed for the \ufb01rst s + 1 points; so the complexity for\n\nselecting the jth point of a batch would be O(cid:0)min(2j\u22121, 2s) n(log n + m log m + T )(cid:1).\n\n5\n\n\fWe also improve the bounding and pruning strategy developed in [12], and our new procedure is\nnow similar in spirit to lazy evaluation [5]. On drug discovery datasets, on average over 98% of the\ncandidate points can be pruned in each iteration, a speedup of over 50 times. Details and results\nregarding the effectiveness of pruning in practice can be found in the supplemental material.\n\n4 Related work\n\nActive search (AS) and its variants have been the subject of a great deal of recent work [6, 7, 12, 14,\n15, 22, 24\u201327]; nevertheless, to the best of our knowledge, this is the \ufb01rst study on batch active search\nunder this particular setting.\nWarmuth et al. [25, 26] considered a different goal of batch active search: to \ufb01nd all or a given number\nof actives as soon as possible. Our goal, in contrast, is to \ufb01nd as many actives as possible in a given\nbudget, which encourages more nonmyopic planning. Their proposed batch policy is to pick the\nmost-likely positive points (those farthest from an SVM hyperplane), which is quite different from our\nmore-principled approach using Bayesian decision theory. Their policy is an analog of the one-step\n(greedy) myopic policy in our treatment, which performs poorly as we will show in Section 5.\nActive search is a speci\ufb01c realization of active learning (AL). Though highly related, AL and AS\nhave fundamentally different goals: learning an accurate model versus retrieving positive examples.\nOne might argue that AS can be reduced to AL by \ufb01rst learning the decision boundary, then just\ncollecting the predicted positive examples. However, it is often the case that the given budget is\nfar from enough for an accurate model to be learned, and we must have more-elegant approaches\nto balance exploration and exploitation. Good AL policies could perform poorly in AS: Warmuth\net al. [25] compared several variants of uncertainty sampling (arguably one of the most popular AL\npolicies) with greedy AS policies, and demonstrated that the greedy policies performed much better\nin terms of retrieving active compounds. Jiang et al. [12] showed a k-nn classi\ufb01cation model trained\non 10 times more data still retrieved signi\ufb01cantly fewer positives than a simple greedy AS policy. In\ntheir supplementary materials, they also showed that uncertainty sampling performed much worse\nthan the greedy policy given the same budget.\nBatch policies have been studied extensively in active learning [2\u20134, 11].\nIn particular, Chen\nand Krause [3] proposed an adaptive submodular objective function, and chose points greedily by\nmaximizing the marginal gain. This algorithm is similar in spirit to our batch-ENS policy, though it is\nnot known whether the batch-ENS function is submodular. They also proved a result similar in spirit\nto our Theorem 1 (also called \u201cadaptivity gap\u201d in [1]) to show that the price of parallelism is bounded\nirrespective of batch sizes. This theorem holds under the stochastic submodular maximization setting\nwhere the outcomes of variables are independent, which certainly does not apply in our case.\nAS can be considered as Bayesian optimization (BO) with binary observations and cumulative\nreward maximization on a \ufb01nite domain. Numerous batch BO policies have been studied [8\u201310, 28]\nGinsbourger et al. [8, 9] proposed q-EI, in which q points are selected simultaneously to maximize\nthe expected improvement. They also used sequential simulation to optimize the q-EI objective,\nand proposed two heuristic \u201c\ufb01ctional oracles\u201d called the Kriging believer (KB) and constant liar\n(CL). KB sets the label of a chosen point to its posterior mean, and CL sets the label to be a chosen\nconstant, such as the maximum, mean, or minimum of the observed values so far. This is similar to\nour pessimistic or optimistic oracles.\nActive search is also related to the multi-armed bandit (MAB) setting if a point is considered an arm\nand each point can only be pulled once. In the Gaussian process (GP) bandit optimization setting,\nDesautels et al. [5] proposed GP-BUCB, a batch extension of the GP-UCB policy [19]. They also\nconstruct the batch by sequentially simulating the GP-UCB policy, where the values of the selected\npoints are \u201challucinated\u201d with the posterior mean, equivalent to the Kriging believer heuristic for q-EI.\nA similar strategy was adopted also in [27] to identify the compounds with the top-k continuous-\nvalued binding activities against an identi\ufb01ed biological target. These approaches don\u2019t directly apply\nto our setting, where the target values are binary. In fact, Jiang et al. [12] showed that a UCB-style\npolicy adapted to the Bernoulli setting performs worse than a myopic two-step policy on a range of\nproblems in their supplementary material.\n\n6\n\n\fTable 1: Results for drug discovery data: Average number of positive compounds found by the\nbaseline uncertain-greedy batch, greedy-batch, sequential simulation and batch-ENS policies. Each\ncolumn corresponds to a batch size, and each row a policy. Each entry is an average over 200\nexperiments (10 datasets by 20 experiments). The budget T is 500. Highlighted are the best (bold) for\neach batch size and those that are not signi\ufb01cantly worse (blue italic) than the best under one-sided\npaired t-tests with signi\ufb01cance level \u03b1 = 0.05.\n\nUGB\ngreedy\nss-one-1\nss-one-m\nss-one-s\nss-one-0\nss-two-1\nss-two-m\nss-two-s\nss-two-0\nss-ENS-1\nss-ENS-m\nss-ENS-s\nss-ENS-0\nbatch-ENS-16\nbatch-ENS-32\n\n1\n-\n269.8\n269.8\n269.8\n269.8\n269.8\n281.1\n281.1\n281.1\n281.1\n295.1\n295.1\n295.1\n295.1\n295.1\n295.1\n\n5\n257.6\n268.1\n260.7\n264.5\n266.8\n268.1\n237.1\n252.6\n248.9\n252.5\n269.4\n293.8\n289.9\n293.6\n300.8\n300.8\n\n10\n257.9\n264.1\n254.6\n257.7\n261.3\n264.1\n219.8\n246.4\n242.5\n247.6\n247.9\n290.2\n278.3\n289.1\n296.2\n295.5\n\n15\n258.3\n261.6\n245.2\n250.0\n256.7\n261.6\n210.8\n237.2\n235.3\n247.9\n227.2\n285.3\n269.8\n288.1\n293.9\n297.9\n\n20\n250.1\n258.2\n233.6\n244.4\n248.7\n258.2\n212.1\n232.9\n226.6\n244.4\n223.1\n281.6\n262.6\n287.5\n292.1\n290.6\n\n25\n246.0\n257.0\n223.4\n236.5\n244.1\n257.0\n196.2\n225.1\n219.2\n240.4\n210.3\n274.4\n255.0\n280.7\n288.0\n288.8\n\n50\n218.8\n240.1\n200.8\n211.7\n214.9\n240.1\n172.1\n200.2\n196.7\n225.6\n185.3\n249.4\n220.8\n269.2\n275.8\n281.4\n\n75\n206.2\n227.2\n182.9\n195.4\n202.4\n227.2\n158.8\n181.6\n175.3\n213.8\n152.6\n217.2\n185.5\n257.2\n272.3\n275.5\n\n100\n172.1\n208.2\n178.9\n179.4\n181.3\n208.2\n152.9\n167.2\n158.3\n199.1\n148.7\n203.1\n161.2\n241.0\n252.9\n263.5\n\n5 Experiments\n\nIn this section, we comprehensively compare all our 14 proposed policies: (1) greedy-batch, coded as\n\u201cgreedy\u201d; (2\u201313) sequential simulation, coded as \u201css-P-O\u201d, where P (for policy) could be \u201cone\u201d (for\none-step), \u201ctwo\u201d (for two-step), or \u201cENS\u201d, and O (for oracle) could be \u201cs\u201d (sampling), \u201cm\u201d (most-\nlikely), \u201c0\u201d (pessimistic, i.e., always-0), or \u201c1\u201d (optimistic, i.e., always-1); (14) batch-ENS. Suggested\nby one of the the anonymous reviewers, we also compare these policies against another na\u00efve baseline,\nwhich we call uncertain-greedy batch (UGB), where we build batches that simultaneously encourage\nexploration and exploitation by combining the most uncertain points and the highest probability\npoints. We use a hyperparamter r \u2208 (0, 1) to control the proportion, choosing the most uncertain\npoints for 100r% of the batch, and greedy points for the remaining 100(1\u2212 r)% of the batch. We run\nthis policy for r \u2208 {0.1, 0.2, . . . , 0.9}, and show the best result among them. We implement all these\npolicies with the MATLAB active learning toolbox.1 Following Jiang et al. [12], we consider data\nfrom three application domains: a citation network, material science, and drug discovery. Similar\npatterns could be found on the three domains, so we mainly present the results for 10 drug discovery\ndatasets in the main text. Results for other datasets are detailed in the supplemental material. We use\nk nearest neighbor (k-nn) with k = 100 as our probability model for the drug discovery datasets, and\nk = 50 for the other two datasets (following the studies in [7, 12]).\n\n5.1 Drug discovery\n\nWe conduct our main investigation on a drug discovery application. In this application, our goal is\nto \ufb01nd chemical compounds that exhibit binding activity with a target protein. Each target protein\nde\ufb01nes an active search problem. We consider the \ufb01rst ten of the 120 datasets used in [7, 12] and only\nthe ECFP4 \ufb01ngerprint, which showed the best performance in those studies. These datasets share a\npool of 100 000 negative compounds randomly selected from the ZINC database [20]. The number of\npositives of the ten datasets varies from 221 to 1024, with mean 553.\nFor each dataset, we start with one random initial positive seed observation and repeat the experiment\n20 times. We test for batch sizes b \u2208 {5, 10, 15, 20, 25, 50, 75, 100}, we also show the results for\nsequential search (b = 1) as a reference. The budget is set as T = 500. We test batch-ENS with\n\n1https://github.com/rmgarnett/active_learning\n\n7\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Average performance ratio between sequential policies and batch policies, as a function\nof batch size, produced using averaged results in Table 1. (b) Progressive probabilities of the chosen\npoints of greedy and batch-ENS-32, averaged over results for batch size 50 on all 10 drug discovery\ndatasets and 20 experiments each.\n\n16 and 32 samples, coded as batch-ENS-16 and batch-ENS-32. We show the number of positive\ncompounds found in Table 1, averaged over the 10 datasets and 20 experiments each, so each entry\nin the table is an average over 200 experiments. We highlight the best result for each batch size in\nboldface. We conduct a paired t-test for each other policy against the best one, and also emphasize\nthose that are not signi\ufb01cantly worse than the best with signi\ufb01cance level \u03b1 = 0.05 in blue italics.\nWe highlight the following observations. (1) The performance decreases as the batch size increases.\n(2) Nonmyopic policies are consistently better than myopics ones; in particular, batch-ENS is a clear\nwinner. (3) For sequential simulation policies, the pessimistic oracle is almost always the best.\nFor batch-ENS, we \ufb01nd batch-ENS with 32 samples often performs better than with 16, especially\nfor larger batch sizes. We have run batch-ENS for b = 50 with N \u2208 {2, 4, 8, 16, 32, 64} (see\nsupplemental material), and \ufb01nd that the performance improves considerably as the number of\nsamples increases, but the magnitude of this improvement tends to decrease with larger numbers. We\nbelieve 32 label samples offers a good tradeoff between ef\ufb01ciency and accuracy for b = 50.\n\n5.2 Discussion\n\nWe now discuss our observations in more detail. First we see all our proposed policies perform better\nthan the heuristic uncertain-greedy batch, even if we optimistically assume the best hyperparameter\nof this policy (not to mention we hardly know what the best hyperparameter should be in practice).\nOur framework based on Bayesian decision theory offers a more principled approach to batch active\nsearch (especially batch-ENS); and our methods are effectively hyperparameter-free (except the\nnumber of samples used in batch-ENS). In the following, we elaborate on the three observations.\nEmpirical adaptivity gap. Regardless of what policy is used, the performance in general degrades\nas the batch size increases. But how fast? We average the results in Table 1 over all policies for\neach batch size b as an empirical surrogate for OPTb in Theorem 1, and plot the resulting surrogate\nvalue of OPT1\nas a function of b in Figure 1a. Although these policies are not optimal, the empirical\nOPTb\nperformance gap matches our theoretical linear bound surprisingly well. Similar results for different\nbudgets on a different dataset are shown in the supplemental material. These results could provide\nvaluable guidance on choosing batch sizes.\nDespite the overall trends in our results, we see some interesting exceptions. That is, batch-ENS\nwith batch size 5 is signi\ufb01cantly better than that with batch size 1, with a p-value of 0.02 under a\none-sided paired t-test. This is counterintuitive based on our analysis regarding the adaptivity gap.\nWe conjecture that batch-ENS with larger batch sizes forces more (but not too much) exploration,\npotentially improving somewhat on sequential ENS in practice.\n\n8\n\n525507510011.21.4batchsizeaverageperformanceratio020040000.20.40.60.81iterationsprogressiveprobabilitygreedybatch-ENS\fWhy is the pessimistic oracle better? Among the four \ufb01ctional oracles, the pessimistic one usually\nperforms the best for sequential simulation. When combined with a greedy policy, we have provided\nsome mathematical rationale in Proposition 1: sequential simulation then near-optimally maximizes\nthe probability of unit improvement, which is a reasonable criterion. Intuitively, by always assuming\nthe previously added points to be negative, the probabilities of nearby points are lowered, offering a\nrepulsive force compelling later points to be located elsewhere, leading to a more diverse batch. This\nmechanism could help better explore the search space.\nThis hypothesis is veri\ufb01ed by quantifying the diversity of the chosen batches. Speci\ufb01cally, for each\nbatch B, for any xi, xj \u2208 B, we compute the rank of xj according to increasing distance to xi, and\naverage the ranks for all pairs as the diversity score of this batch. We use rank instead of distance\nfor invariance to scale. We \ufb01nd that the diversity scores of the chosen batches align perfectly well\nwith batch active search performance. Details can be found in the supplemental material. Note this\ncoincides with the idea of explicitly using repulsion to create a diverse batch, which has been adopted\nin similar settings such as Bayesian optimization [10].\nMyopic vs. nonmyopic behavior. Nonmyopic policies (ENS-based) almost always perform better\nthan myopic policies. This certainly matches our expectation as nonmyopic policies are always\ncognizant of the budget and hence can better trade off exploration and exploitation [12]. To gain some\ninsight into the nature of this myopic/nonmyopic behavior, in Figure 1b we plot the probabilities\nof the points chosen (at the iteration of being chosen) by the greedy and batch-ENS-32 policies for\nbatch size 50 across the drug discovery datasets. Corresponding plots for other policies are shown\nin the supplemental material. First, in each batch, the trend for greedy is not surprising, since every\nbatch represents the top-50 points ordered by probabilities. For batch-ENS, there is no such trend\nexcept in the last batch, where batch-ENS naturally degenerates to greedy behavior. Second, along the\nwhole search process, greedy has a decreasing trend, likely due to over-exploitation in early stages.\nOn the other hand, batch-ENS has an increasing trend. This could be partly due to more and more\npositives being found. More importantly, we believe this trend is in part a re\ufb02ection of the nonmyopia\nof batch-ENS: in early stages, it tends to explore the search space, so low probability points might be\nchosen. As the remaining budget diminishes, it becomes more exploitive; in particular, the last batch\nis purely exploitive.\n\n6 Conclusion\n\nWe have completed the \ufb01rst study on batch active search, where the goal is to \ufb01nd as many positives\nas possible in a given labeling budget. We derived the Bayesian optimal policy for batch active search,\nand proved a lower bound, linear in batch size, on the performance gap between optimal sequential\nand batch policies. This was shown to match empirical results.\nWe then generalized a recently proposed ef\ufb01cient nonmyopic search (ENS) policy to the batch\nsetting and proposed two approaches to approximately solving the batch version of ENS: sequential\nsimulation with \ufb01ctional labeling oracles and greedy set function maximization. We conducted\ncomprehensive experments on data from three application domains evaluating all fourteen proposed\npolicies. Results show that nonmyopic policies perform signi\ufb01cantly better than myopic ones. By\nanalyzing the results, we gained a deeper understanding of the nonmyopic behavior and \ufb01nd diversity\nto be an importantant consideration for batch policy design. We believe our theoretical and emprical\nanalysis constitute a valuable step towards more-effective application of (batch) active search in\nvarious important domains such as drug discovery and materials science.\n\nAcknowledgments\n\nWe would like to thank all the anonymous reviewers for valuable feedbacks. SJ, GM, and RG were\nsupported by the National Science Foundation (NSF) under award number IIA\u20131355406. GM was\nalso supported by the Brazilian Federal Agency for Support and Evaluation of Graduate Education\n(CAPES). MA was supported by NSF under award number CNS\u20131560191. BM was supported by a\nGoogle Research Award and by NSF under awards CCF\u20131830711, CCF\u20131824303, and CCF\u20131733873.\n\n9\n\n\fReferences\n[1] A. Asadpour, H. Nazerzadeh, and A. Saberi. Stochastic Submodular Maximization. In C. Pa-\npadimitriou and S. Zhang, editors, International Workshop on Internet and Network Economics\n(WINE 2008), volume 5385 of Lecture Notes in Computer Science, pages 477\u2013489. Springer\u2013\nVerlag, 2008.\n\n[2] S. Chakraborty, V. Balasubramanian, and S. Panchanathan. Adaptive Batch Mode Active\nLearning. IEEE Transactions on Neural Networks and Learning Systems, 26(8):1747\u20131760,\n2015.\n\n[3] Y. Chen and A. Krause. Near-optimal Batch Mode Active Learning and Adaptive Submodular\nOptimization. In S. Dasgupta and D. McAllester, editors, Proceedings of the 30th International\nConference on Machine Learning (ICML 2013), volume 28 of Proceedings of Machine Learning\nResearch, pages 160\u2013168, 2013.\n\n[4] N. V. Cuong, W. S. Lee, N. Ye, K. M. A. Chai, and H. L. Chieu. Active Learning for\nProbabilistic Hypotheses Using the Maximum Gibbs Error Criterion.\nIn C. J. C. Burges,\nL. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 26 (NIPS 2013), pages 1457\u20131465, 2013.\n\n[5] T. Desautels, A. Krause, and J. W. Burdick. Parallelizing Exploration-Exploitation Tradeoffs\nin Gaussian Process Bandit Optimization. Journal of Machine Learning Research, 15(Dec):\n4053\u20134103, 2014.\n\n[6] R. Garnett, Y. Krishnamurthy, X. Xiong, J. G. Schneider, and R. P. Mann. Bayesian Optimal\nActive Search and Surveying. In Proceedings of the 29th International Conference on Machine\nLearning (ICML 2012), 2012.\n\n[7] R. Garnett, T. G\u00e4rtner, M. Vogt, and J. Bajorath. Introducing the \u2018active search\u2019 method for\niterative virtual screening. Journal of Computer-Aided Molecular Design, 29(4):305\u2013314, 2015.\n[8] D. Ginsbourger, R. Le Riche, and L. Carraro. A Multi-points Criterion for Deterministic Parallel\n\nGlobal Optimization based on Gaussian Processes. Technical report, 2008. hal-00260579.\n\n[9] D. Ginsbourger, R. Le Riche, and L. Carraro. Kriging is Well-Suited to Parallelize Optimization.\nIn Y. Tenne and G. C. K., editors, Computational Intelligence in Expensive Optimization\nProblems, volume 2 of Adaptation Learning and Optimization, pages 131\u2013162. Springer\u2013Verlag,\n2010.\n\n[10] J. Gonz\u00e1lez, Z. Dai, P. Hennig, and N. Lawrence. Batch Bayesian Optimization via Local\nIn A. Gretton and C. C. Robert, editors, Proceedings of the 19th Arti\ufb01cial\nPenalization.\nIntelligence and Statistics (AISTATS 2016), volume 51 of Proceedings of Machine Learning\nResearch, pages 648\u2013657, 2016.\n\n[11] S. C. Hoi, R. Jin, J. Zhu, and M. R. Lyu. Batch Mode Active Learning and Its Application to\nMedical Image Classi\ufb01cation. In W. Cohen and A. Moore, editors, Proceedings of the 23rd\nInternational Conference on Machine Learning (ICML 2006), pages 417\u2013424, 2006.\n\n[12] S. Jiang, G. Malkomes, G. Converse, A. Shofner, B. Moseley, and R. Garnett. Ef\ufb01cient\nNonmyopic Active Search. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th\nInternational Conference on Machine Learning (ICML 2017), volume 70 of Proceedings of\nMachine Learning Research, pages 1714\u20131723, 2017.\n\n[13] H. J. Kushner. A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve\n\nin the Presence of Noise. Journal of Basic Engineering, 86(1):97\u2013106, 1964.\n\n[14] Y. Ma, T.-K. Huang, and J. G. Schneider. Active Search and Bandits on Graphs using Sigma-\nIn M. Meila and T. Heskes, editors, Proceedings of the 31st Conference on\n\nOptimality.\nUncertainty in Arti\ufb01cial Intelligence (UAI 2015), pages 542\u2013551, 2015.\n\n[15] Y. Ma, D. J. Sutherland, R. Garnett, and J. G. Schneider. Active Pointillistic Pattern Search.\nIn G. Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the 18th International\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS 2015), volume 38 of Proceedings\nof Machine Learning Research, pages 672\u2013680, 2015.\n\n10\n\n\f[16] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An Analysis of Approximations for\n\nMaximizing Submodular Set Functions. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[17] D. Oglic, R. Garnett, and T. G\u00e4rtner. Active Search in Intensionally Speci\ufb01ed Structured Spaces.\nIn Proceedings of the 31st AAAI Conference on Arti\ufb01cial Intelligence (AAAI 2017), pages\n2443\u20132449, 2017.\n\n[18] D. Oglic, S. A. Oatley, S. J. F. Macdonald, T. Mcinally, R. Garnett, J. D. Hirst, and T. G\u00e4rtner.\nActive Search for Computer-Aided Drug Design. Molecular Informatics, 37(1\u20132):1700130,\n2018.\n\n[19] N. Srinivas, A. Krause, S. Kakade, and M. W. Seeger. Gaussian Process Optimization in the\nBandit Setting: No Regret and Experimental Design. In J. F\u00fcrnkranz and T. Joachims, editors,\nProceedings of the 27th International Conference on Machine Learning (ICML 2010), pages\n1015\u20131022, 2010.\n\n[20] T. Sterling and J. J. Irwin. ZINC 15 \u2013 Ligand Discovery for Everyone. Journal of Chemical\n\nInformation and Modeling, 55(11):2324\u20132337, 2015.\n\n[21] D. J. Sutherland, B. P\u00f3czos, and J. Schneider. Active Learning and Search on Low-Rank\nMatrices. In R. Ghani, T. E. Senator, P. Bradley, R. Parekh, and J. He, editors, Proceedings of\nthe 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\n(KDD 2013), pages 212\u2013220, 2013.\n\n[22] H. P. Vanchinathan, A. Marfurt, C.-A. Robelin, D. Kossmann, and A. Krause. Discovering\nValuable Items from Massive Data. In Proceedings of the 21st ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining (KDD 2015), pages 1195\u20131204, 2015.\n[23] J. Wang. Bayesian Optimization with Parallel Function Evaluations and Multiple Information\nSources: Methodology with Applications in Biochemistry, Aerospace Engineering, and Machine\nLearning. PhD thesis, Operations Research and Information Engineering, Cornell University,\n2017.\n\n[24] X. Wang, R. Garnett, and J. G. Schneider. Active Search on Graphs. In R. Ghani, T. E. Senator,\nP. Bradley, R. Parekh, and J. He, editors, Proceedings of the 19th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining (KDD 2013), pages 731\u2013738, 2013.\n\n[25] M. K. Warmuth, G. R\u00e4tsch, M. Mathieson, J. Liao, and C. Lemmen. Active Learning in the\nDrug Discovery Process. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances\nin Neural Information Processing Systems 14 (NIPS 2001), pages 1449\u20131456, 2001.\n\n[26] M. K. Warmuth, J. Liao, G. R\u00e4tsch, M. Mathieson, S. Putta, and C. Lemmen. Active Learning\nwith Support Vector Machines in the Drug Discovery Process. Journal of Chemical Information\nand Computer Sciences, 43(2):667\u2013673, 2003.\n\n[27] K. Williams, E. Bilsland, A. Sparkes, W. Aubrey, M. Young, L. N. Soldatova, K. De Grave,\nJ. Ramon, M. de Clare, W. Sirawaraporn, et al. Cheaper faster drug development validated\nby the repositioning of drugs against neglected tropical diseases. Journal of the Royal Society\nInterface, 12(104):20141289, 2015.\n\n[28] J. Wu and P. Frazier. The Parallel Knowledge Gradient Method for Batch Bayesian Optimization.\nIn D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 29 (NIPS 2016), pages 3126\u20133134. 2016.\n\n11\n\n\f", "award": [], "sourceid": 576, "authors": [{"given_name": "Shali", "family_name": "Jiang", "institution": "Washington University in St. Louis"}, {"given_name": "Gustavo", "family_name": "Malkomes", "institution": "Washington University in St. Louis"}, {"given_name": "Matthew", "family_name": "Abbott", "institution": "Washington University in St. Louis"}, {"given_name": "Benjamin", "family_name": "Moseley", "institution": "Carnegie Mellon University"}, {"given_name": "Roman", "family_name": "Garnett", "institution": "Washington University in St. Louis"}]}