{"title": "Active Learning in the Drug Discovery Process", "book": "Advances in Neural Information Processing Systems", "page_first": 1449, "page_last": 1456, "abstract": "", "full_text": "Active Learning\n\nin the Drug Discovery Process\n\nManfred K. Warmuth , Gunnar R\u00a8atsch \u0001, Michael Mathieson\u0002 ,\n\nJun Liao\u0003 , Christian Lemmen\u0004\n\n\u0003 Computer Science Dep., Univ. of Calif. at Santa Cruz\n\u0004 DuPont Pharmaceuticals,150 California St. San Francisco.\n\n\u0001 FHG FIRST, Kekul\u00b4estr. 7, Berlin, Germany\n\n\u0005 manfred,mathiesm,liaojun\n\nclemmen@biosolveit.de\n\n@cse.ucsc.edu, Gunnar.Raetsch@anu.edu.au,\n\nAbstract\n\nWe investigate the following data mining problem from Computational\nChemistry: From a large data set of compounds, \ufb01nd those that bind to\na target molecule in as few iterations of biological testing as possible. In\neach iteration a comparatively small batch of compounds is screened for\nbinding to the target. We apply active learning techniques for selecting\nthe successive batches.\nOne selection strategy picks unlabeled examples closest to the maximum\nmargin hyperplane. Another produces many weight vectors by running\nperceptrons over multiple permutations of the data. Each weight vector\nprediction and we pick the unlabeled examples for which\nvotes with its\nthe prediction is most evenly split between\n. For a third selec-\ntion strategy note that each unlabeled example bisects the version space\nof consistent weight vectors. We estimate the volume on both sides of\nthe split by bouncing a billiard through the version space and select un-\nlabeled examples that cause the most even split of the version space.\nWe demonstrate that on two data sets provided by DuPont Pharmaceu-\nticals that all three selection strategies perform comparably well and are\nmuch better than selecting random batches for testing.\n\nand\n\n1 Introduction\n\nTwo of the most important goals in Computational Drug Design are to \ufb01nd active com-\npounds in large databases quickly and (usually along the way) to obtain an interpretable\nmodel for what makes a speci\ufb01c subset of compounds active. Activity is typically de\ufb01ned\n\nDFG (JA 379/9-1, MU 987/1-1) and travel grants from EU (Neurocolt II).\n\n All but last author received partial support from NSF grant CCR 9821087\n\u000b Current address: Austrialian National University, Canberra, Austrialia. Partially supported by\n\f Current address: BioSolveIT GmbH, An der Ziegelei 75, Sankt Augustin, Germany\n\n\n\u0002\n\u0006\n\u0007\n\b\n\t\n\fas binding to a target molecule. Most of the time an iterative approach to the problem is\nemployed. That is in each iteration a batch of unlabeled compounds is screened against the\ntarget using some sort of biological assay[MGST97]. The desired goal is that many active\nhits show up in the assays of the selected batches.\n\nFrom the Machine Learning point of view all examples (compounds) are initially unla-\nbeled. In each iteration the learner selects a batch of un-labeled examples for being labeled\nas positive (active) or negative (inactive). In Machine Learning this type of problem has\nbeen called \u201cquery learning\u201d [Ang88] \u201cselective sampling\u201d [CAL90] or \u201cactive learning\u201d\n[TK00]. A Round0 data set contains 1,316 chemically diverse examples, only 39 of which\nare positive. A second Round1 data set has 634 examples with 150 positives. 1This data\nset is preselected on the basis of medicinal chemistry intuition. Note that our classi\ufb01ca-\ntion problem is fundamentally asymmetric in that the data sets have typically many more\nnegative examples and the Chemists are more interested in the positive hits because these\ncompounds might lead to new drugs. What makes this problem challenging is that each\ncompound is described by a vector of 139,351 binary shape features. The vectors are\nsparse (on the average 1378 features are set per Round0 compound and 7613 per Round1\ncompound).\n\nWe are working with retrospective data sets for which we know all the labels. However, we\nsimulate the real-life situation by initially hiding all labels and only giving to the algorithm\nthe labels for the requested batches of examples (virtual screening). The long-term goal of\nthis type of research is to provide a computer program to the Chemists which will do the\nfollowing interactive job: At any point new unlabeled examples may be added. Whenever\na test is completed, the labels are given to the pro-\ngram. Whenever a new test needs to be set up, the\nChemist asks the program to suggest a batch of un-\nlabeled compounds. The suggested batch might be\n\u201cedited\u201d and augmented using the invaluable knowl-\nedge and intuition of the medicinal Chemist. The\nhope is that the computer assisted approach allows\nfor mining larger data sets more quickly. Note that\ncompounds are often generated with virtual Combi-\nnatorial Chemistry. Even though compound descrip-\ntors can be computed, the compounds have not been\nsynthesized yet. In other words it is comparatively\neasy to generate lots of unlabeled data.\n\ntypes\nof\nare active,\n\n1: Three\n\nFigure\ncom-\nare\npounds/points:\ninactive and\nare yet unlabeled. The\nMaximum Margin Hyperplane is used as\nthe internal classi\ufb01er.\n\nIn our case the Round0 data set consists of com-\npounds from Vendor catalog and corporate collec-\ntions. Much more design effort went into the harder Round1 data set. Our initial results are\nvery encouraging. Our selection strategies do much better than choosing random batches\nindicating that the long-term goal outlined above may be feasible.\n\nThus from the Machine Learning point of view we have a \ufb01xed set of points in\nthat\nare either unlabeled or labeled positive or negative. (See Figure 1). The binary descriptors\nof the compounds are rather \u201ccomplete\u201d and the data is always linearly separable. Thus\nwe concentrate on simple linear classi\ufb01ers in this paper. 2 We analyzed a large number\nof different ways to produce hyperplanes and combine hyperplanes. In the next section we\ndescribe different selection strategies on the basis of these hyperplanes in detail and provide\nan experimental comparison. Finally in Section 3 we give some theoretical justi\ufb01cation for\nwhy the strategies are so effective.\n\n\u0003\u0005\u0004\u0007\u0006\t\b\u000b\n\n\u0006\r\f\u000e\u0004\n\n1Data provided by DuPont Pharmaceuticals.\n2On the current data sets kernels did not improve the results (not shown).\n\n\n\u0001\n\u0002\n\f2 Different Selection Criteria and their Performance\n\nA selection algorithm is speci\ufb01ed in three parts: a batch size, an initialization and a se-\nlection strategy. In practice it is not cost effective to test single examples at a time. We\nalways chose 5% of the total data set as our batch size, which matches reasonably with\ntypical experimental constraints. The initial batches are chosen at random until at least one\npositive and one negative example are found. Typically this is achieved with the \ufb01rst batch.\nAll further batches are chosen using the selection strategy.\n\nAs we mentioned in the introduction, all our selection strategies are based on linear classi-\n\ufb01ers of the data labeled so far. All examples are normalized to unit-length and we consider\nis again unit-\n\nhomogeneous hyperplanes \u0005\u0001\u0003\u0002\u0005\u0004\u0007\u0006\b\n\t\f\u000b\nlength. A plane \u0004\n\nwhere the normal direction \u0004\n\u0004\u000e\u0006\u000f\u0011\u0010 on the example/compound \n\n.\n\npredicts with sign\n\nOnce we specify how the weight vector is found then the next batch is found by selecting the\nunlabeled examples closest to this hyperplane. The simplest way to obtain such a weight\nvector is to run a perceptron over the labeled data until it produces a consistent weight\nvector (Perc). Our second selection strategy (called SVM) uses the maximum margin hy-\nperplane [BGV92] produced by a Support Vector Machine. When using the perceptron\nto predict for example handwritten characters, it has been shown that \u201cvoting\u201d the\npre-\ndictions of many hyperplanes improves the predictive performance [FS98]. So we always\nstart from the weight vector zero and do multiple passes over the data until the perceptron\nis consistent. After processing each example we store the weight vector. We remember\nall weight vectors for each pass3 and do this for 100 random permutations of the labeled\nvote. The prediction on an example is positive if\nexamples. Each weight vector gets one\nthe total vote is larger than zero and we select the unlabeled examples whose total vote is\nclosest to zero4. We call this selection strategy VoPerc.\nThe dot product is commutative. So when\u0004\u0012\u0006\u0013\u0015\u0014\u0016\u000b\nof the hyperplane \u0004\n(Recall all instances and weight vectors have unit-length). A weight vector \u0004\nall \nunlabeled hyperplane\na billiard is bounced 1000 times inside the version space and the fraction !\npoints on the positive side of \"\u0017\n\n. In a dual view the point \u0004\n-labeled examples \r\n\u0017 bisects the version space. For our third selection strategy (VolEst)\n\u0017 of bounce\n\nis computed. The prediction for #\u0017\n\nlarger than half and the strategy selects unlabeled points whose fraction is closest to half.\n\nlies on the positive side\nlies on the positive side of the hyperplane\nthat is\nfor\n. The set of all consistent weight vectors is called the version space which is a section\nof the unit hypersphere bounded by the planes corresponding to the labeled examples. An\n\n\u0018\u0017\u0005\u0019\u001b\u001a\u001c\u0017\u001d\u0010 must lie on the \u001a\u001e\u0017 -side of the plane\u001f\u0017\n\nconsistent with all\n\nthen the point\n\nis\n\nis positive if !\n\nIn Figure 2 (left) we plot the true positives and false positives w.r.t.\nthe whole data set\nfor Perc and VoPerc showing that VoPerc performs slightly better. Also VoPerc has lower\nvariance (Figure 2 (right)). Figure 3 (left) shows the averaged true positives and false\npositives of VoPerc, SVM, and VolEst. We note that all three perform similarly. We also\nplotted ROC curves after each batch has been added (not shown). These plots also show\nthat all three strategies are comparable.\n\nThe three strategies VoPerc, SVM, and VolEst all perform much better than the correspond-\ning strategies where the selection criterion is to select random unlabeled examples instead\nof using a \u201cclosest\u201d criterion. For example we show in Figure 4 that SVM is signi\ufb01cantly\nbetter than SVM-Rand. Surprisingly the improvement is larger on the easier Round0 data\nset. The reason is that the Round0 has a smaller fraction of positive examples (3%). Recall\n\n3Surprisingly with some smart bookkeeping this can all be done with essentially no computational\n\noverhead. [FS98]\n\n4Instead of voting the predictions of all weight vectors one can also average all the weight vectors\nafter normalizing them and select unlabeled examples closest to the resulting single weight vector.\nThis way of averaging leads to slightly worse results (not shown).\n\n\u0006\n\u0007\n\u0007\n\n\u0007\n\u0017\n\fPerc true pos \nPerc false pos \nVoPerc true pos \nVoPerc false pos\n\n150\n\n100\n\n50\n\nl\n\ns\ne\np\nm\na\nx\ne\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nn\n\n0\n0\n\n150\n\nl\n\ns\ne\np\nm\na\nx\ne\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nn\n\n100\n\n50\n\n0\n0\n\n30\n\n25\n\n20\n\n15\n\n10\n\nn\no\n\ni\nt\n\ni\n\na\nv\ne\nd\n\n \n\nd\nr\na\nd\nn\na\nt\ns\n\n5\n\n0\n0\n\n1\n\n150\n\ns\nt\ni\nh\n \nf\no\n \nr\ne\nb\nm\nu\nn\n\n \nl\n\na\no\n\nt\n\nt\n\n100\n\n50\n\n1\n\n0\n0\n\nPerc true pos \nPerc false pos \nVoPerc true pos \nVoPerc false pos\n\nVoPerc true pos \nVoPerc false pos\nSVM true pos \nSVM false pos \nVolEst true pos \nVolEst false pos\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nfraction of examples selected\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nfraction of examples selected\n\n1\n\nFigure 2: (left) Average (over 10 runs) of true positives and false positives on the entire Round1 data\nset after each 5% batch for Perc and VoPerc. (right) Standard deviation over 10 runs.\n\n5% batch size \n1 example batch size\n\n0.2\n\n0.4\n\n0.6\n\nfraction of examples selected\n\n0.8\n\n0.2\n\n0.4\n\n0.6\n\nfraction of examples selected\n\n0.8\n\n1\n\nFigure 3: (left) Average (over 10 runs) of true and false positives on entire Round1 data set after each\n5% batch for VoPerc, SVM, and VolEst. (right) Comparison of 5% batch size and 1 example batch\nsize for VoPerc on Round1 data.\n\nthat the Round1 data was preselected by the Chemists for actives and the fraction was raised\nto about 25%. This suggest that our methods are particularly suitable when few positive\nexamples are hidden in a large set of negative examples.\n\nThe simple strategy SVM of choosing unlabeled examples closest to the maximum margin\nhyperplane has been investigated by other authors (in [CCS00] for character recognition\nand in [TK00] for text categorization). The labeled points that are closest to the hyperplane\nare called the support vectors because if all other points are removed then the maximum\nmargin hyperplane remains unchanged. In Figure 5 we visualize the location of the points\nin relation to the center of the hyperplane. We show the location of the points projected\nonto the normal direction of the hyperplane. For each 5% batch the location of the points\nis scattered onto a thin stripe. The hyperplane crosses the stripe in the middle. In the left\n1. In the right\nplot the distances are scaled so that the support vectors are at distance\nplot the geometric distance to the hyperplane is plotted. Recall that we pick unlabeled\npoints closest to the hyperplane (center of the stripe). As soon as the \u201cwindow\u201d between\nthe support vectors is cleaned most positive examples have been found (compare with the\nSVM curves given in Figure 3 (left)). Also shrinking the width of the geometric window\ncorresponds to improved generalization.\n\nSo far our three selection strategies VoPerc, SVM and VolEst have shown similar perfor-\nmance. The question is whether the performance criterion considered so far is suitable\nfor the drug design application. Here the goal is to label/verify many positive compounds\nquickly. We therefore think that the total number of positives (hits) among all examples\ntested so far is the best performance criterion. Note that the total number of hits of the ran-\ndom selection strategy grows linearly with the number of batches (In each random batch\n\n\u0007\n\frandom true pos \nrandom false pos \nclosest true pos \nclosest false pos\n\nrandom true pos \nrandom false pos \nclosest true pos \nclosest false pos\n\n0.2\n\n0.4\n\n0.6\n\nfraction of examples selected\n\n0.8\n\nFigure 4: Comparisons of SVM using random batch selection and closest batch selection.\nRound0 data. (right) Round1 data.\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nfraction of examples selected\n\n1\n\n(left)\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\nl\n\ns\ne\np\nm\na\nx\ne\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nn\n\n5\n\n0\n0\n\nd\ne\n\nl\n\nl\n\nt\nc\ne\ne\ns\n \ns\ne\np\nm\na\nx\ne\n \nf\no\n \nn\no\ni\nt\nc\na\nr\nf\n\n1\n0.95\n0.9\n0.85\n0.8\n0.75\n0.7\n0.65\n0.6\n0.55\n0.5\n0.45\n0.4\n0.35\n0.3\n0.25\n0.2\n0.15\n0.1\n0.05\n0\n\u22122\n\n\u22121.5\n\n150\n\n100\n\n50\n\nl\n\ns\ne\np\nm\na\nx\ne\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nn\n\n1\n\n0\n0\n\n1\n0.95\n0.9\n0.85\n0.8\n0.75\n0.7\n0.65\n0.6\n0.55\n0.5\n0.45\n0.4\n0.35\n0.3\n0.25\n0.2\n0.15\n0.1\n0.05\n0\n\u22120.3\n\n\u22120.2\n\nl\n\nd\ne\nt\nc\ne\ne\ns\n \ns\ne\np\nm\na\nx\ne\n\nl\n\n \nf\n\no\n\n \n\nn\no\n\ni\nt\nc\na\nr\nf\n\n1.5\n\n2\n\n\u22121\n\n\u22120.5\n\n0\n\n0.5\n\n1\n\nnormalized distance to hyperplane \n\n\u22120.1\n\n0\n\n0.1\n\n0.2\n\n0.3\n\ngeometric distance to hyperplane \n\nFigure 5: (left) Scatter plot of the distance of examples to the maximum margin hyperplane normal-\nized so support vectors are at\n1. (right) Scatter plot of the geometric distance of examples to the\nhyperplane. Each stripe shows location of a random sub-sample of points (Round1 data) after an\nadditional 5% batch has been labeled by SVM. Selected examples are black x, unselected positives\nare red plus, unselected negatives are blue square.\n\nwe expect 5% hits). In contrast the total number of hits of VoPerc, SVM and VolEst is\n5% in the \ufb01rst batch (since it is random) but much faster thereafter (See Figure 6). VoPerc\nperforms the best.\n\nSince the positive examples are much more valuable in our application, we also changed\nthe selection strategy SVM to selecting unlabeled examples of largest positive distance 5\n.\n\nto the maximum margin hyperplane\u0004\n\n(SVM\u0001 ) rather than smallest distance\n\nCorrespondingly VoPerc\u0001 picks the unlabeled example with the highest vote and VolEst\u0001\n. The total hit plots of the resulting\npicks the unlabeled example with the largest fraction !\n, VoPerc\u0001 and VolEst\u0001 are improved ( see Figure 7 versus Figure\nmodi\ufb01ed strategies SVM\u0001\n6 ). However the generalization plots of the modi\ufb01ed strategies (i.e. curves like Figure\n3(left)) are slightly worse for the new versions. Thus in some sense the original strategies\nare better at \u201dexploration\u201d (giving better generalization on the entire data set) while the\nmodi\ufb01ed strategies are better at \u201dexploitation\u201d (higher number of total hits). We show this\ntrade-off in Figure 8 for SVM and SVM\u0001\nand VolEst\n\n. The same trade-off occurs for the VoPerc\n\nFinally we investigate the effect of batch size on performance. For simplicity we only show\ntotal hit plots for VoPerc( Figure 3 (right) ). Note that for our data a batch size of 5% (31\nexamples for Round1) is performing not much worse than the experimentally unrealistic\nbatch size of only 1 example. Only when the results for batch size 1 are much better than\n\n\u0010 pairs of strategies(not shown).\n\n5In Figure 5 this means we are selecting from right to left\n\n\n\u0004\n\u0006\n\n\u0002\n\u0004\n\u0006\n\n\u0002\n\u0017\n\u0001\n\u0010\n\u0001\n\f40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\ns\nt\ni\n\nh\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nn\n\n \nl\n\na\n\nt\n\no\n\nt\n\n5\n\n0\n0\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\ns\nt\ni\n\nh\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nn\n \nl\na\nt\no\nt\n\n5\n\n0\n0\n\n150\n\n100\n\n50\n\ns\nt\ni\n\nh\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nn\n\n \nl\n\na\n\nt\n\no\n\nt\n\n1\n\n0\n0\n\n150\n\n100\n\n50\n\ns\nt\ni\n\nh\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nn\n\n \nl\n\na\n\nt\n\no\n\nt\n\n1\n\n0\n0\n\nVoPerc\nSVM \nVolEst\n\nVoPerc+\nSVM+ \nVolEst+\n\nVoPerc\nSVM \nVolEst\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nfraction of examples selected\n\nFigure 6: Total hit performance on Round0 (left) and Round1 (right) data of\nVolEst with 5% batch size.\n\n\u0002\u0001\u0004\u0003\n\n0.2\n\n0.4\n\n0.6\n\nfraction of examples selected\n\n1\n\n0.8\n, VoPerc and\n\nVoPerc+\nSVM+ \nVolEst+\n\n1\nand\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nfraction of examples selected\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\nfraction of examples selected\n\nwith 5% batch size.\n\nFigure 7: Total hit performance on Round0 (left) and Round1 (right) data of\nVolEst\nthe results for larger batch sizes, more sophisticated selection strategies are worth exploring\nthat pick say a batch that is \u201cclose\u201d and at the same time \u201cdiverse\u201d.\n\n\u0002\u0001\u0005\u0003\u0007\u0006\n\n, VoPerc\n\nAt this point our data sets are still small enough that we were able to precompute all dot\nproducts (the kernel matrix). After this preprocessing, one pass of a perceptron is at most\nthe number of mistakes.\ntime. For the computa-\n\n\t\f\u000b\nFinding the maximum margin hyperplane is estimated at \b\ntion of VolEst we need to spend \b\nwe used SVM Light [Joa99] and the billiard algorithm of [Ruj97, RM00, HGC99].\n\n\u0010 per bounce of the billiard. In our implementations\n\nis the number of labeled examples and\n\n\u0010 , where\n\n\t\u000f\u000e\u0011\u0010\n\n\t\u0012\u000e\n\nIf we have the internal hypothesis of the algorithm then for applying the selection criterion\nwe need to evaluate the hypothesis for each unlabeled point. This cost is proportional to\nthe number of support vectors for the SVM-based methods and proportional to the number\nof mistakes for the perceptron-based methods. In the case of VolEst we again need \b\ntime per bounce, where\n\nis the number of labeled points.\n\n\t\u0013\u000e\n\nOverall VolEst was clearly the slowest. For much larger data sets VoPerc seems to be the\nsimplest and the most adaptable.\n\n3 Theoretical Justi\ufb01cations\n\nAs we see in Figure 5(right) the geometric margin of the support vectors (half the width\nof the window) is shrinking as more examples are labeled. Thus the following goal is rea-\nsonable for designing selection strategies: pick unlabeled examples that cause the margin\nto shrink the most. The simplest such strategy is to pick examples closest to the maxi-\nmum margin hyperplane since these example are expected to change the maximum margin\n\n\u0006\n\u0006\n\b\n\t\n\u000b\n\f\n\u0010\n\u0010\n\t\n\f150\n\n100\n\n50\n\ns\nt\ni\n\nh\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nn\n\n \nl\n\na\no\n\nt\n\nt\n\n150\n\n100\n\n50\n\nl\n\ns\ne\np\nm\na\nx\ne\n \nf\no\n \nr\ne\nb\nm\nu\nn\n\nSVM+\nSVM\n\nSVM+ true pos\nSVM true pos\nSVM+ false pos\nSVM false pos\n\n0\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\n0\n0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nfraction of examples selected\n\nfraction of examples selected\n\nFigure 8: Exploitation versus Exploration: (left) Total hit performance and (right) True and False\npositives performance (right) of SVM and\n\non Round 1 data\n\nhyperplane the most [TK00, CCS00].\n\n\u0002\u0001\u0005\u0003\n\nAn alternative goal is to reduce the volume of the version space. This volume is a rough\nmeasure of the remaining uncertainty in the data. Recall that both the weight vectors and\nas well\n. The maximum margin\nin version space with the largest sphere that is completely\n\ninstances have unit length. Thus \u0004\u000e\u0006\u0001\nas (in the dual view) the distance of the point \u0004\nhyperplane \u0004\ncontained in the version space [Ruj97, RM00]. After labeling only one side of the plane\nremains. So if passes close to the point \u0004\n\nthen about half of the largest sphere is\neliminated from the version space. So this is a second justi\ufb01cation for selecting unlabeled\nexamples closest to the maximum margin hyperplane.\n\nis the distance of the point \nto the plane \n\nis the point \u0004\n\nto the plane \u0004\n\nOur selection strategy VolEst starts from any point inside the version space and then\nbounces a billiard 1000 times. The billiard is almost always ergodic (See discussion in\n\n\u0017 of bounces on the positive side of an unlabeled hyperplane\nis an estimate of the fraction of volume on the positive side of \n\u0017 will be labeled, the best example are those that split the version space in half. Thus\n\n[Ruj97]). Thus the fraction !\nhow \nin VolEst we select unlabeled points for which !\n\nis closest to half. The thinking underly-\ning our strategy VolEst is most closely related to the Committee Machine where \u0001\nrandom\nconcepts in the version space are asked to vote on the next random example and the label\nof that example is requested only if the vote is close to an even split [SOS92].\n\n. Since it is unknown\n\n\u0017 by the fraction of the total\n\nWe tried to improve our estimate of the volume by replacing !\ntrajectory located on the positive side of #\u0017\nvector \u0004\npassing through the Bayes point \u0004\n\n. On our two data sets this did not improve the\nperformance (not shown). We also averaged the 1000 bounce points. The resulting weight\n(an approximation to the center of mass of the version space) approximates the\nso called Bayes point [Ruj97] which has the following property: Any unlabeled hyperplane\ncuts the version space roughly 6 in half. We thus tested\na selection strategy which picks unlabeled points closest to the estimated center of mass.\nThis strategy was again indistinguishable from the other two strategies based on bouncing\nthe billiard.\nWe have no rigorous justi\ufb01cation for the \u0001 variants of our algorithms.\n4 Conclusion\n\nWe showed how the active learning paradigm ideally \ufb01ts the drug design cycle. After some\ndeliberations we concluded that the total number of positive examples (hits) among the\ntested examples is the best performance criterion for the drug design application. We found\n\n6Even in dimension two there is no point that does this exactly [Ruj97].\n\n\u0006\n\n\n\u0017\n\u0017\n\u0017\n\fthat a number of different selection strategies with comparable performance. The variants\nthat select the unlabeled examples with the highest score (i.e. the \u0001 variants) perform better.\nOverall the selection strategies based on the Voted Perceptron were the most versatile and\nshowed slightly better performance.\n\nReferences\n\n[Ang88] D. Angluin. Queries and concept learning. Machine Learning, 2:319\u2013342,\n\n1988.\n\n[BGV92] B.E. Boser, I.M. Guyon, and V.N. Vapnik. A training algorithm for optimal\nmargin classi\ufb01ers. In D. Haussler, editor, Proceedings of the 5th Annual ACM\nWorkshop on Computational Learning Theory, pages 144\u2013152, 1992.\n\n[CAL90] D. Cohn, L. Atlas, and R. Ladner. Training connectionist networks with\nqueries and selective sampling. Advances in Neural Information Processing\nSystems, 2:566\u2013573, 1990.\n\n[CCS00] C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin\n\nclassi\ufb01ers. In Proceedings of ICML2000, page 8, Stanford, CA, 2000.\n\n[FS98] Y. Freund and R. Schapire. Large margin classi\ufb01cation using the perceptron\nIn Proc. 11th Annu. Conf. on Comput. Learning Theory. ACM\n\nalgorithm.\nPress, New York, NY, July 1998.\n\n[HGC99] Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines:\nEstimating the bayes point in kernel space. In Proceedings of IJCAI Workshop\nSupport Vector Machines, pages 23\u201327, 1999.\n\n[Joa99] T. Joachims. Making large\u2013scale SVM learning practical. In B. Sch\u00a8olkopf,\nC.J.C. Burges, and A.J. Smola, editors, Advances in Kernel Methods \u2014 Sup-\nport Vector Learning, pages 169\u2013184, Cambridge, MA, 1999. MIT Press.\n\n[MGST97] P. Myers, J. Greene, J. Saunders, and S. Teig. Rapid, reliable drug discovery.\n\nToday\u2019s Chemist at Work, 6:46\u201353, 1997.\n\n[RM00] P. Ruj\u00b4an and M. Marchand. Computing the bayes kernel classi\ufb01er. In Advances\n\nin Large Margin Classi\ufb01ers, volume 12, pages 329\u2013348. MIT Press, 2000.\n\n[Ruj97] P. Ruj\u00b4an. Playing billiard in version space. Neural Computation, 9:99\u2013122,\n\n1997.\n\n[SOS92] H. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceed-\nings of the Fifth Workshop on Computational Learning Theory, pages 287\u2013\n294, 1992.\n\n[TK00] S. Tong and D. Koller. Support vector machine active learning with applica-\ntions to text classi\ufb01cation.\nIn Proceedings of the Seventeenth International\nConference on Machine Learning, San Francisco, CA, 2000. Morgan Kauf-\nmann.\n\n\f", "award": [], "sourceid": 2097, "authors": [{"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}, {"given_name": "Gunnar", "family_name": "R\u00e4tsch", "institution": null}, {"given_name": "Michael", "family_name": "Mathieson", "institution": null}, {"given_name": "Jun", "family_name": "Liao", "institution": null}, {"given_name": "Christian", "family_name": "Lemmen", "institution": null}]}