{"title": "Learning Active Learning from Data", "book": "Advances in Neural Information Processing Systems", "page_first": 4225, "page_last": 4235, "abstract": "In this paper, we suggest a novel data-driven approach to active learning (AL). The key idea is to train a regressor that predicts the expected error reduction for a candidate sample in a particular learning state. By formulating the query selection procedure as a regression problem we are not restricted to working with existing AL heuristics; instead, we learn strategies based on experience from previous AL outcomes. We show that a strategy can be learnt either from simple synthetic 2D datasets or from a subset of domain-specific data. Our method yields strategies that work well on real data from a wide range of domains.", "full_text": "Learning Active Learning from Data\n\nKsenia Konyushkova\u21e4\n\nCVLab, EPFL\n\nLausanne, Switzerland\n\nSznitman Raphael\n\nARTORG Center, University of Bern\n\nksenia.konyushkova@epfl.ch\n\nraphael.sznitman@artorg.unibe.ch\n\nBern, Switzerland\n\nPascal Fua\nCVLab, EPFL\n\nLausanne, Switzerland\npascal.fua@epfl.ch\n\nAbstract\n\nIn this paper, we suggest a novel data-driven approach to active learning (AL).\nThe key idea is to train a regressor that predicts the expected error reduction for a\ncandidate sample in a particular learning state. By formulating the query selection\nprocedure as a regression problem we are not restricted to working with existing\nAL heuristics; instead, we learn strategies based on experience from previous AL\noutcomes. We show that a strategy can be learnt either from simple synthetic 2D\ndatasets or from a subset of domain-speci\ufb01c data. Our method yields strategies that\nwork well on real data from a wide range of domains.\n\n1\n\nIntroduction\n\nMany modern machine learning techniques require large amounts of training data to reach their full\npotential. However, annotated data is hard and expensive to obtain, notably in specialized domains\nwhere only experts whose time is scarce and precious can provide reliable labels. Active learning\n(AL) aims to ease the data collection process by automatically deciding which instances an annotator\nshould label to train an algorithm as quickly and effectively as possible.\nOver the years many AL strategies have been developed for various classi\ufb01cation tasks, without\nany one of them clearly outperforming others in all cases. Consequently, a number of meta-AL\napproaches have been proposed to automatically select the best strategy. Recent examples include\nbandit algorithms [2, 11, 3] and reinforcement learning approaches [5]. A common limitation of these\nmethods is that they cannot go beyond combining pre-existing hand-designed heuristics. Besides,\nthey require reliable assessment of the classi\ufb01cation performance which is problematic because\nthe annotated data is scarce. In this paper, we overcome these limitations thanks to two features\nof our approach. First, we look at a whole continuum of AL strategies instead of combinations\nof pre-speci\ufb01ed heuristics. Second, we bypass the need to evaluate the classi\ufb01cation quality from\napplication-speci\ufb01c data because we rely on experience from previous tasks and can seamlessly\ntransfer strategies to new domains.\nMore speci\ufb01cally, we formulate Learning Active Learning (LAL) as a regression problem. Given\na trained classi\ufb01er and its output for a speci\ufb01c sample without a label, we predict the reduction in\ngeneralization error that can be expected by adding the label to that datapoint. In practice, we show\nthat we can train this regression function on synthetic data by using simple features, such as the\nvariance of the classi\ufb01er output or the predicted probability distribution over possible labels for a\n\n\u21e4http://ksenia.konyushkova.com\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fspeci\ufb01c datapoint. The features for the regression are not domain-speci\ufb01c and this enables to apply\nthe regressor trained on synthetic data directly to other classi\ufb01cation problems. Furthermore, if a\nsuf\ufb01ciently large annotated set can be provided initially, the regressor can be trained on it instead\nof on synthetic data. The resulting AL strategy is then tailored to the particular problem at hand.\nWe show that LAL works well on real data from several different domains such as biomedical\nimaging, economics, molecular biology and high energy physics. This query selection strategy\noutperforms competing methods without requiring hand-crafted heuristics and at a comparatively low\ncomputational cost.\n\n2 Related work\n\nThe extensive development of AL in the last decade has resulted in various strategies. They include\nuncertainty sampling [32, 15, 27, 34], query-by-committee [7, 13], expected model change [27,\n30, 33], expected error or variance minimization [14, 9] and information gain [10]. Among these,\nuncertainty sampling is both simple and computationally ef\ufb01cient. This makes it one of the most\npopular strategies in real applications. In short, it suggests labeling samples that are the most uncertain,\ni.e., closest to the classi\ufb01er\u2019s decision boundary. The above methods work very well in cases such\nas the ones depicted in the top row of Fig. 2, but often fail in the more dif\ufb01cult ones depicted in the\nbottom row [2].\nAmong AL methods, some cater to speci\ufb01c classi\ufb01ers, such as those relying on Gaussian pro-\ncesses [16], or to speci\ufb01c applications, such as natural language processing [32, 25], sequence\nlabeling tasks [28], visual recognition [21, 18], semantic segmentation [33], foreground-background\nsegmentation [17], and preference learning [29, 22]. Moreover, various query strategies aim to\nmaximize different performance metrics, as evidenced in the case of multi-class classi\ufb01cation [27].\nHowever, there is no one algorithm that consistently outperforms all others in all applications [28].\nMeta-learning algorithms have been gaining in popularity in recent years [31, 26], but few of them\ntackle the problem of learning AL strategies. Baram et al. [2] combine several known heuristics\nwith the help of a bandit algorithm. This is made possible by the maximum entropy criterion, which\nestimates the classi\ufb01cation performance without labels. Hsu et al. [11] improve it by moving the focus\nfrom datasamples as arms to heuristics as arms in the bandit and use a new unbiased estimator of\nthe test error. Chu and Lin [3] go further and transfer the bandit-learnt combination of AL heuristics\nbetween different tasks. Another approach is introduced by Ebert et al. [5]. It involves balancing\nexploration and exploitation in the choice of samples with a Markov decision process.\nThe two main limitations of these approaches are as follows. First, they are restricted to combining\nalready existing techniques and second, their success depends on the ability to estimate the classi\ufb01-\ncation performance from scarce annotated data. The data-driven nature of LAL helps to overcome\nthese limitations. Sec. 5 shows that it outperforms several baselines including those of Hsu et al. [11]\nand Kapoor et al. [16].\n\n3 Towards data-driven active learning\n\nIn this section we brie\ufb02y introduce the active leaning framework along with uncertainty sampling\n(US), the most frequently-used AL heuristic. Then, we motivate why a data-driven approach can\nimprove AL strategies and how it can deal with the situations where US fails. We select US as a\nrepresentative method because it is popular and widely applicable, however the behavior that we\ndescribe is typical for a wide range of AL strategies.\n\n3.1 Active learning (AL)\n\nGiven a machine learning model and a pool of unlabeled data, the goal of AL is to select which data\nshould be annotated in order to learn the model as quickly as possible. In practice, this means that\ninstead of asking experts to annotate all the data, we select iteratively and adaptively which datapoints\nshould be annotated next. In this paper we are interested in classifying datapoints from a target\ndataset Z = {(x1, y1), . . . , (xN , yN )}, where xi is a D-dimensional feature vector and yi 2{ 0, 1}\nis its binary label. We choose a probabilistic classi\ufb01er f that can be trained on some Lt \u21e2Z to map\n\n2\n\n\ffeatures to labels, ft(xi) = \u02c6yi, through the predicted probability pt(yi = y | xi). The standard AL\nprocedure unfolds as follows.\n\nunannotated data Ut = Z \\ Lt with t = 0.\n\n1. The algorithm starts with a small labeled training dataset Lt \u21e2Z and large pool of\n2. A classi\ufb01er ft is trained using Lt.\n3. A query selection procedure picks an instance x\u21e4 2U t to be annotated at the next iteration.\n4. x\u21e4 is given a label y\u21e4 by an oracle. The labeled and unlabeled sets are updated.\n5. t is incremented, and steps 2\u20135 iterate until the desired accuracy is achieved or the number\n\nof iterations has reached a prede\ufb01ned limit.\n\nUncertainty sampling (US) US has been reported to be successful in numerous scenarios and\nsettings and despite its simplicity, it often works remarkably well [32, 15, 27, 34, 17, 24]. It focuses\nits selection on samples which the current classi\ufb01er is the least certain about. There are several\nde\ufb01nitions of maximum uncertainty but one of the most widely used ones is to select a sample x\u21e4\nthat maximizes the entropy H over the probability of predicted classes:\n\nx\u21e4 = arg max\n\nxi2Ut H[pt(yi = y | xi)] .\n\n(1)\n\n3.2 Success, failure, and motivation\n\nWe now motivate the need for LAL by presenting two toy examples. In the \ufb01rst one, US is empirically\nobserved to be the best greedy approach, but in the second it makes suboptimal decisions. Let us\nconsider simple two-dimensional datasets Z and Z0 drawn from the same distribution with an equal\nnumber of points in each class (Fig. 1, left). The data in each class comes from a Gaussian distribution\nwith a different mean and the same isotropic covariance. We can initialize the AL procedure of\nSec. 3.1 with one sample from each class and its respective label: L0 = {(x1, 0), (x2, 1)}\u21e2Z and\nU0 = Z \\ L0. Here we train a simple logistic regression classi\ufb01er f on L0 and then test it on Z0.\nIf |Z0| is large, the test error can be considered as a good approximation of the generalization error:\n`0 =P(x0,y0)2Z0 `(\u02c6y, y0), where \u02c6y = f0(x0).\nLet us try to label every point x from U0 one by one, form a new labeled set Lx = L0 [ (x, y)\nand check what error a new classi\ufb01er fx yields on Z0, that is, `x = P(x0,y0)2Z0 `(\u02c6y, y0), where\n\u02c6y = fx(x0). The difference between errors obtained with classi\ufb01ers constructed on L0 and Lx\nindicates how much the addition of a new datapoint x reduces the generalization error: x = `0 `x.\nWe plot x for the 0/1 loss function, averaged over 10 000 experiments as a function of the predicted\nprobability p0 (Fig. 1, left). By design, US would select a datapoint with probability of class 0 close\nto 0.5. We observe that in this experiment, the datasample with p0 closest to 0.5 is indeed the one\nthat yields the greatest error reduction.\n\n0.02\n\n0.01\n\n0.00\n\n0.03\n\n0.00\n\n\u22120.03\n\n0\n\n1\n\n0\n\n1\n\nFigure 1: Balanced vs unbalanced. Left: two Gaussian clouds of the same size. Right: two Gaussian\nclouds with the class 0 twice bigger than class 1. The test error reduction as a function of predicted\nprobability of class 0 in the respective datasets.\n\nIn the next experiment, the class 0 contains twice as many datapoints as the other class, see Fig. 1\n(right). As before, we plot the average error reduction as a function of p0. We observe this time that\nthe value of p0 that corresponds to the largest expected error reduction is different from 0.5 and thus\nthe choice of US becomes suboptimal. Also, the reduction in error is no longer symmetric for the two\nclasses. The more imbalanced the two classes are, the further from the optimum the choice made by\n\n3\n\n\fUS is. In a complex realistic scenario, there are many other factors such as label noise, outliers and\nshape of distribution that further compound the problem.\nAlthough query selection procedures can take into account statistical properties of the datasets and\nclassi\ufb01er, there is no simple way to foresee the in\ufb02uence of all possible factors. Thus, in this paper,\nwe suggest Learning Active Learning (LAL). It uses properties of classi\ufb01ers and data to predict the\npotential error reduction. We tackle the query selection problem by using a regression model; this\nperspective enables us to construct new AL strategies in a \ufb02exible way. For instance, in the example\nof Fig. 1 (right) we expect LAL to learn a model that automatically adapts its selection to the relative\nprevalence of the two classes without having to explicitly state such a rule. Moreover, having learnt\nthe error reduction prediction function, we can seamlessly transfer LAL strategy to other domains\nwith very little annotated data.\n\n4 Monte-Carlo LAL\n\nOur approach to AL is data-driven and can be formulated as a regression problem. Given a repre-\nsentative dataset with ground truth, we simulate an online learning procedure using a Monte-Carlo\ntechnique. We propose two versions of AL strategies that differ in the way how datasets for learning\na regressor are constructed. When building the \ufb01rst one, LALINDEPENDENT, we incorporate unused\nlabels individually and at random to retrain the classi\ufb01er. Our goal is to correlate the change in\ntest performance with the properties of the classi\ufb01er and of newly added datapoint. To build the\nLALITERATIVE strategy, we further extend our method by a sequential procedure to account for\nselection bias caused by AL. We formalize our LAL procedures in the remainder of the section.\n\nIndependent LAL\n\n4.1\nLet the representative dataset2 consist of a training set D and a testing set D0. Let f be a classi\ufb01er with\na given training procedure. We start collecting data for the regressor by splitting D into a labeled set\nL\u2327 of size \u2327 and an unlabeled set U\u2327 containing the remaining points (Alg. 1 DATAMONTECARLO).\nWe then train a classi\ufb01er f on L\u2327 , resulting in a function f\u2327 that we use to predict class labels for ele-\nments x0 from the test set D0 and estimate the test classi\ufb01cation loss `\u2327 . We characterize the classi\ufb01er\nstate by K parameters \u2327 = {1\n\u2327 }, which are speci\ufb01c to the particular classi\ufb01er type and\nare sensitive to the change in the training set while being relatively invariant to the stochasticity of\nthe optimization procedure. For example, they can be the parameters of the kernel function if f is\nkernel-based, the average depths of the trees if f is a tree-based method, or prediction variability if f\nis an ensemble classi\ufb01er. The above steps are summarized in lines 3\u20135 of Alg. 1.\n\n\u2327 , . . . , K\n\nsize \u2327\n\nAlgorithm 1 DATAMONTECARLO\n1: Input: training set D and test set D0, classi\ufb01cation procedure f, partitioning function SPLIT,\n2: Initialize: L\u2327 , U\u2327 SPLIT(D,\u2327 )\n3: train a classi\ufb01er f\u2327\n4: estimate the test set loss `\u2327\n5: compute the classi\ufb01cation state parameters { 1\n6: for m = 1 to M do\n7:\n8:\n9:\n10:\n11:\n12:\n\u00b7\u00b7\u00b7 R\n13:\n14: \u2305 { \u21e0m} , { m} : 1 \uf8ff m \uf8ff M\n15: Return: matrix of learning states \u2305 2 RM\u21e5(K+R), vector of reductions in error 2 RM\n\nselect x 2U \u2327 at random\nform a new labeled dataset Lx L \u2327 [{ x}\ncompute the datapoint parameters { 1\ntrain a classi\ufb01er fx\nestimate the new test loss `x\ncompute the loss reduction x `\u2327 `x\n\u21e0m \u21e51\n\nx\u21e4, m x\n\n\u2327 , . . . , K\n\u2327 }\n\nx, . . . , R\nx }\n\n\u2327\n\n\u00b7\u00b7\u00b7 K\n\n\u2327\n\n 1\nx\n\n2The representative dataset is an annotated dataset that does not need to come from the domain of interest. In\nSec. 5 we show that a simple synthetic dataset is suf\ufb01cient for learning strategies that can be applied to various\nreal tasks across various domains.\n\n4\n\n\fdataset D0\n\nclassi\ufb01cation procedure f\n\nAlgorithm 2 BUILDLALINDEPENDENT\n1: Input: iteration range {\u2327min, . . . ,\u2327 max},\n2: SPLIT random partitioning function\n3: Initialize: generate train set D and test\n4: for \u2327 in {\u2327min, . . . ,\u2327 max} do\n5:\n6:\n\nfor q = 1 to Q do\n\u2305\u2327q , \u2327q DATAMONTECARLO\n(D,D0, f, SPLIT, \u2327)\n7: \u2305, { \u2305\u2327q},{\u2327q}\n8: train a regressor g : \u21e0 7! on data \u2305, \n9: construct LALINDEPENDENT A(g):\nx\u21e4 = arg maxx2Ut g[\u21e0t,x)]\n10: Return: LALINDEPENDENT\n\ndataset D0\n\nclassi\ufb01cation procedure f\n\nAlgorithm 3 BUILDLALITERATIVE\n1: Input: iteration range {\u2327min, . . . ,\u2327 max},\n2: SPLIT random partitioning function\n3: Initialize: generate train set D and test\n4: for \u2327 in {\u2327min, . . . ,\u2327 max} do\n5:\n6:\n\nfor q = 1 to Q do\n\u2305\u2327q , \u2327q DATAMONTECARLO\n(D,D0, f, SPLIT, \u2327)\n7:\n\u2305\u2327 , \u2327 { \u2305\u2327q , \u2327q}\ntrain regressor g\u2327 : \u21e0 7! on \u2305\u2327 , \u2327\n8:\nSPLIT A (g\u2327 )\n9:\n10: \u2305, { \u2305\u2327 , \u2327}\n11: train a regressor g : \u21e0 7! on \u2305, \n12: construct LALITERATIVE A(g)\n13: Return: LALITERATIVE\n\n\u2327\n\n\u00b7\u00b7\u00b7 K\n\n\u2327\n\n 1\nx\n\n\u2327 ,L2\n\nx, . . . , R\n\n\u2327 , . . . ,LQ\n\n\u2327 = \u21e51\n\nNext, we randomly select a new datapoint x from U\u2327 which is characterized by R parameters\nx }. For example, they can include the predicted probability to belong to class y,\n x = { 1\nthe distance to the closest point in the dataset or the distance to the closest labeled point, but they do\nnot include the features of x. We form a new labeled set Lx = L\u2327 [{ x} and retrain f (lines 7\u201313 of\nAlg. 1). The new classi\ufb01er fx results in the test-set loss `x. Finally, we record the difference between\nprevious and new loss x = `\u2327 `x which is associated to the learning state in which it was received.\nx\u21e4 2 RK+R,\nThe learning state is characterized by a vector \u21e0x\nwhose elements depend both on the state of the current classi\ufb01er f\u2327 and on the datapoint x. To build\nan AL strategy LALINDEPENDENT we repeat the DATAMONTECARLO procedure for Q different\n\u2327 and T various labeled subset sizes \u2327 = 2, . . . , T + 1 (Alg. 2 lines 4\ninitializations L1\nand 5). For each initialization q and iteration \u2327, we sample M different datapoints x each of which\nyields classi\ufb01er/datapoint state pairs with an associated reduction in error (Alg. 1, line 13). This\nresults in a matrix \u2305 2 R(QM T )\u21e5(K+R) of observations \u21e0 and a vector 2 RQM T of labels \n(Alg. 2, line 9).\nOur insight is that observations \u21e0 should lie on a smooth manifold and that similar states of the\nclassi\ufb01er result in similar behaviors when annotating similar samples. From this, a regression function\ncan predict the potential error reduction of annotating a speci\ufb01c sample in a given classi\ufb01er state. Line\n10 of the BUILDLALINDEPENDENT algorithm looks for a mapping g : \u21e0 7! . This mapping is not\nspeci\ufb01c to the dataset D, and thus can be used to detect samples that promise the greatest increase in\nclassi\ufb01er performance in other target domains Z. The resulting LALINDEPENDENT strategy greedily\nselects a datapoint with the highest potential error reduction at iteration t by taking the maximum of\nthe value predicted by the regressor g:\n\n\u00b7\u00b7\u00b7 R\n\nx\u21e4 = arg max\n\nx2Ut\n\ng(t, x).\n\n(2)\n\nIterative LAL\n\n4.2\nFor any AL strategy at iteration t > 0, the labeled set Lt consists of samples selected at previous\niterations, which is clearly not random. However, in Sec. 4.1 the dataset D is split into L\u2327 and U\u2327\nrandomly no matter how many labeled samples \u2327 are available.\nTo account for this, we modify the approach of Section 4.1 in Alg. 3 BUILDLALITERATIVE. Instead\nof partitioning the dataset D into L\u2327 and U\u2327 randomly, we suggest simulating the AL procedure\nwhich selects datapoints according to the strategy learnt on the previously collected data (Alg. 3,\nline 10). It \ufb01rst learns a strategy A(g2) based on a regression function g2 which selects the most\npromising 3rd datapoint when 2 random points are available. In the next iteration, it learns a strategy\nA(g3) that selects 4th datapoint given 2 random points and 1 selected by A(g2) etc. In this way,\n\n5\n\n\fsamples at each iteration depend on the samples at the previous iteration and the sampling bias of AL\nis represented in the data \u2305, from which the \ufb01nal strategy LALITERATIVE is learnt.\nThe resulting strategies LALINDEPENDENT and LALITERATIVE are both reasonably fast during\nthe online steps of AL: they just require evaluating the RF regressor. The of\ufb02ine part, generating a\ndatasets to learn a regression function, can induce a signi\ufb01cant computational cost depending on the\nparameters of the algorithm. For this reason, LALINDEPENDENT is preferred to LALITERATIVE\nwhen an application-speci\ufb01c strategy is needed.\n\n5 Experiments\n\nImplementation details We test AL strategies in two possible settings: a) cold start, where we\nstart with one sample from each of two classes and b) warm start, where a larger dataset of size\nN0 \u2327 N is available to train the initial classi\ufb01er. In cold start we take the representative dataset\nto be a 2D synthetic dataset where class-conditional data distributions are Gaussian and we use the\nsame LAL regressor in all 7 classi\ufb01cation tasks. While we mostly concentrate on cold start scenario,\nwe look at a few examples of warm start because we believe that it is largely overloooked in the\nlitterature, but it has a signi\ufb01cant practical interest. Learning a classi\ufb01er for a real-life application\nwith AL rarely starts from scratch, but a small initial annotated set is provided to understand if a\nlearning-based approach is applicable at all. While a small set is good to provide an initial insight, a\nreal working prototype still requires much more training data. In this situation, we can bene\ufb01t from\nthe available training data to learn a specialized AL strategy for an application.\nIn most of the experiments, we use Random Forest (RF) classi\ufb01ers for f and a RF regressor for\ng. The state of the learning process \u21e0t at time t consists of the following features: a) predicted\nprobability p(y = 0|Lt, x); b) proportion of class 0 in Lt; c) out-of-bag cross-validated accuracy\nof ft; d) variance of feature importances of ft; e) forest variance computed as variance of trees\u2019\npredictions on Ut; f) average tree depth of the forest; g) size of Lt. For additional implementational\ndetails, including examples of the synthetic datasets, parameters of the data generation algorithm and\nfeatures in the case of GP classi\ufb01cation, we refer the reader to the supplementary material. The code\nis made available at https://github.com/ksenia-konyushkova/LAL.\n\nBaselines and protocol We consider the three versions of our approach: a) LAL-independent-2D,\nLALINDEPENDENT strategy trained on a synthetic dataset of cold start; b) LAL-iterative-2D,\nLALITERATIVE strategy trained on a synthetic dataset of cold start; c) LAL-independent-WS,\nLALINDEPENDENT strategy trained on warm start representative data. We compare them against\nthe following 4 baselines: a) Rs, random sampling; b) Us, uncertainty sampling; c) Kapoor [16], an\nalgorithm that balances exploration and exploitation by incorporating mean and variance estimation\nof the GP classi\ufb01er; d) ALBE [11], a recent example of meta-AL that adaptively uses a combination\nof strategies, including Us, Rs and that of Huang et al. [12] (a strategy that uses the topology of the\nfeature space in the query selection). The method of Hsu et al. [11] is chosen as a our main baseline\nbecause it is a recent example of meta AL and is known to outperform several benchmarks.\nIn all AL experiments we select samples from a training set and report the classi\ufb01cation performance\non an independent test set. We repeat each experiment 50\u2013100 times with random permutations of\ntraining and testing splits and different initializations. Then we report the average test performance as\na function of the number of labeled samples. The performance metrics are task-speci\ufb01c and include\nclassi\ufb01cation accuracy, IOU [6], dice score [8], AMS score [1], as well as area under the ROC curve\n(AUC).\n\n5.1 Synthetic data\nTwo-Gaussian-clouds experiments\nIn this dataset we test our approach with two classi\ufb01ers: RF\nand Gaussian Process classi\ufb01er (GPC). Due to the the computational cost of GPC, it is only tested in\nthis experiment. We generate 100 new unseen synthetic datasets of the form as shown in the top row\nof Fig. 2 and use them for testing AL strategies. In both cases the proposed LAL strategies select\ndatapoints that help to construct better classi\ufb01ers faster than Rs, Us, Kapoor and ALBE.\n\nXOR-like experiments XOR-like datasets are known to be challenging for many machine learning\nmethods and AL is no exception. It was reported in Baram et al. [2] that various AL algorithms\n\n6\n\n\f0.9\n\n0.8\n\n0.7\n\n0.6\n\n1.0\n\n0.8\n\n0.6\n\ny\nc\na\nr\nu\nc\nc\na\n\ny\nc\na\nr\nu\nc\nc\na\n\nGaussian clouds, RF\n\nGaussian clouds, GP\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\nRs\nUs\nALBE\nKapoor\nLAL-independent-2D\nLAL-iterative-2D\n\n0\n\n50\n\n100\n\n0\n\n50\n\n100\n\nCheckerboard 2x2\n\nCheckerboard 4x4\n\nRotated checkerboard 2x2\n\n0.85\n\n0.75\n\n0.65\n\n0.55\n\n0.45\n\n1.0\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0\n\n100\n\n200\n\n0\n\n100\n\n200\n\n0\n\n100\n\n200\n\n# labelled points\n\n# labelled points\n\n# labelled points\n\nFigure 2: Experiments on the synthetic data. Top row: RF and GP on 2 Gaussian clouds. Bottom\nrow from left to right: experiments on Checkerboard 2 \u21e5 2, Checkerboard 4 \u21e5 4, and Rotated\nCheckerboard 2 \u21e5 2 datasets.\n\nstruggle with tasks such as those depicted in the bottom row of Fig. 2, namely Checkerboard 2 \u21e5 2\nand Checkerboard 4 \u21e5 4. Additionally, we consider Rotated Checkerboard 2 \u21e5 2 dataset (Fig. 2,\nbottom row, right). The task for RF becomes more dif\ufb01cult in this case because the discriminating\nfeatures are no longer aligned to the axis. As previously observed [2], Us loses to Rs in these\ncases. ALBE does not suffer from such adversarial conditions as much as Us, but LAL-iterative-2D\noutperforms it on all XOR-like datasets.\n\n5.2 Real data\n\nWe now turn to real data from domains where annotating is hard because it requires special training\nto do it correctly:\n\nStriatum, 3D Electron Microscopy stack of rat neural tissue, the task is to detect and segment\n\nmitochondria [20, 17];\n\nMRI, brain scans obtained from the BRATS competition [23], the task is to segment brain tumor in\n\nT1, T2, FLAIR, and post-Gadolinium T1 MR images;\n\nCredit card [4], a dataset of credit card transactions made in 2013 by European cardholders, the task\n\nis to detect fraudulent transactions;\n\nSplice, a molecular biology dataset with the task of detecting splice junctions in DNA sequences [19];\nHiggs, a high energy physics dataset that contains measurements simulating the ATLAS experi-\n\nment [1], the task is to detect the Higgs boson in the noise signal.\n\nAdditional details about the above datasets including sizes, dimensionalities and preprocessing\ntechniques can be found in the supplementary materials.\n\nCold Start AL Top row of Fig. 3 depicts the results of applying Rs, Us, LAL-independent-\n2D, and LAL-iterative-2D on the Striatum, MRI, and Credit card datasets. Both LAL strategies\noutperform Us, with LAL-iterative-2D being the best of the two. The best score of Us in these\ncomplex real-life tasks is reached 2.2\u20135 times faster by the LAL-iterative-2D. Considering that\nthe LAL regressor was learned using a simple synthetic 2D dataset, it is remarkable that it works\neffectively on such complex and high-dimensional tasks. Due to the high computational cost of\nALBE, we downsample Striatum and MRI datasets to 2000 datapoints (referred to as Striatum mini\nand MRI mini). Downsampling was not possible for the Credit card dataset due to the sparsity\nof positive labels (0.17%). We see in the bottom row of Fig. 3 that ALBE performs worse than\n\n7\n\n\fStriatum\n\nMRI\n\nCredit card\n\nU\nO\n\nI\n\nU\nO\n\nI\n\n0.65\n\n0.50\n\n0.35\n\n0.20\n\n0.05\n\n0.70\n\n0.55\n\n0.40\n\n0.25\n\n0.10\n\n0\n\n250\n\n500\n\nC\nU\nA\n\n0.96\n\n0.93\n\n0.90\n\n0.87\n\n0.84\n\n0\n\n100\n\n200\n\n0\n\n150\n\n300\n\nRs\nUs\nALBE\nLAL-independent-2D\nLAL-iterative-2D\n\ne\nc\ni\nd\n\ne\nc\ni\nd\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n0\n\n100\n\n200\n\n0\n\n100\n\n200\n\n# labelled points\n\n# labelled points\n\nFigure 3: Experiments on real data. Top row: IOU for Striatum, dice score for MRI and AUC for\nCredit card as a function of a number of labeled points. Bottom row: Comparison with ALBE on the\nStriatum mini and MRI mini datasets.\n\nUs but better than Rs. We ascribe this to the lack of labeled data, which ALBE needs to estimate\nclassi\ufb01cation accuracy (see Sec. 2).\n\nWarm Start AL In Fig. 4 we compare LAL-independent-WS on the Splice and Higgs datasets\nby initializing BUILDLALINDEPENDENT with 100 and 200 datapoints from the corresponding tasks.\nNotice that this is the only experiment where a signi\ufb01cant amount of labelled data in the domain of\ninterest is available prior to AL. We tested ALBE on the Splice dataset, however in the Higgs dataset\nthe number of iterations in the experiment is too big. LAL-independent-WS outperforms other\nmethods with ALBE delivering competitive performance\u2014yet, at a high computational cost\u2014only\nafter many AL iterations.\n\nSplice\n\nHiggs\n\ny\nc\na\nr\nu\nc\nc\na\n\n0.95\n\n0.92\n\n0.89\n\n300\n\nS\nM\nA\n\n270\n\n240\n\nRs\nUs\nALBE\nLAL-independent-WS\n\n100\n\n200\n\n300\n\n210\n\n1000\n\n2000\n\nlabelled points\n\nlabelled points\n\nFigure 4: Experiments on the real datasets in warm start scenario. Accuracy for Splice is on the left,\nAMS score for Higgs is on the right.\n\n5.3 Analysis of LAL strategies and time comparison\n\nTo better understand LAL strategies, we show in Fig. 5 (left) the relative importance of the features\nof the regressor g for LALITERATIVE. We observe that both classi\ufb01er state parameters and datapoint\nparameters in\ufb02uence the AL selection giving evidence that both of them are important for selecting a\npoint to label. In order to understand what kind of selection LALINDEPENDENT and LALITERATIVE\ndo, we record the predicted probability of the chosen datapoint p(y\u21e4 = 0|Dt, x\u21e4) in 10 cold start\nexperiments with the same initialization on the MRI dataset. Fig. 5 (right) shows the histograms\nof these probabilities for Us, LAL-independent-2D and LAL-iterative-2D. LAL strategies have\n\n8\n\n\fforest variance\n\nfeature importance\ntree depth\n\nprobability\n\nout-of-bag\nproportion\n\nsize\n\n1000\n\n500\n\n0\n\nUs\nLAL-independent-2D\nLAL-iterative-2D\n\n0.0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\nRelative Importance\n\nprobability p\n\n\u22c6\n\nFigure 5: Left: feature importances of the RF regressor representing LALITERATIVE strategy. Right:\nhistograms of the selected probability for different AL strategies in experiments with MRI dataset.\n\nhigh variance and modes different from 0.5. Not only does the selection by LAL strategies differ\nsigni\ufb01cantly from standard Us, but also the independent and iterative approaches differ from each\nother.\n\nComputational costs While collecting synthetic data can be slow, it must only be done once,\nof\ufb02ine, for all applications. Besides, Alg. 1, 2 and 3 can be trivially parallelised thanks to a number\nof independent loops. Collecting data of\ufb02ine for warm start, that is application speci\ufb01c, took us\napproximately 2.7h and 1.9h for Higgs and Splice datasets respectively. By contrast, the online\nuser-interaction part is fast: it simply consists of learning ft, extracting learning state parameters\nand evaluating the regressor g. The LAL run time depends on the parameters of the random forest\nregressor which are estimated via cross-validation (discussed in the supplementary materials). Run\ntimes of a Python-based implementation running on 1 core are given in Tab. 1 for a typical parameter\nset (\u00b1 20% depending on exact parameter values). Real-time performance can be attained by\nparallelising and optimising the code, even in applications with large amounts of high-dimensional\ndata.\n\nTable 1: Time in seconds for one iteration of AL for various strategies and tasks.\n\nDataset\nCheckerboard\nMRI mini\nMRI\nStriatum mini\nStriatum\nCredit\n\nDimensions\n2\n188\n188\n272\n272\n30\n\n# samples\n1000\n2000\n22 934\n2000\n276 130\n142 404\n\n13.12\n64.52\n\nUs ALBE LAL\n0.54\n0.11\n0.11\n0.55\n\u2014 0.88\n0.12\n0.11\n0.59\n\u2014 19.50\n2.05\n\u2014 4.73\n0.43\n\n75.64\n\n6 Conclusion\n\nIn this paper we introduced a new approach to AL that is driven by data: Learning Active Learning.\nWe found out that Learning Active Learning from simple 2D data generalizes remarkably well to\nchallenging new domains. Learning from a subset of application-speci\ufb01c data further extends the\napplicability of our approach. Finally, LAL demonstrated robustness to the choice of type of classi\ufb01er\nand features.\nIn future work we would like to address issues of multi-class classi\ufb01cation and batch-mode AL.\nAlso, we would like to experiment with training the LAL regressor to predict the change in various\nperformance metrics and with different families of classi\ufb01ers. Another interesting direction is to\ntransfer a LAL strategy between different real datasets, for example, by training a regressor on\nmultiple real datasets and evaluating its performance on unseen datasets. Finally, we would like to go\nbeyond constructing greedy strategies by using reinforcement learning.\n\n9\n\n\fAcknowledgements\n\nThis project has received funding from the European Union\u2019s Horizon 2020 Research and Innovation\nProgramme under Grant Agreement No. 720270 (HBP SGA1). We would like to thank Carlos Becker\nand Helge Rhodin for their comments on the text, and Lucas Maystre for his discussions and attention\nto details.\n\nReferences\n[1] C. Adam-Bourdarios, G. Cowan, C. Germain, I. Guyon, B. K\u00e9gl, and D. Rousseau. The\nhiggs boson machine learning challenge. In NIPS 2014 Workshop on High-energy Physics and\nMachine Learning, 2015.\n\n[2] Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. Journal of\n\nMachine Learning Research, 2004.\n\n[3] H.-M. Chu and H.-T. Lin. Can active learning experience be transferred? arXiv preprint\n\narXiv:1608.00667, 2016.\n\n[4] A. Dal Pozzolo, O. Caelen, R. A. Johnson, and G. Bontempi. Calibrating probability with\nundersampling for unbalanced classi\ufb01cation. In IEEE Symposium Series on Computational\nIntelligence, 2015.\n\n[5] S. Ebert, M. Fritz, and B. Schiele. RALF: A reinforced active learning formulation for object\n\nclass recognition. In Conference on Computer Vision and Pattern Recognition, 2012.\n\n[6] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The pascal visual object\n\nclasses (voc) challenge. International journal of computer vision, 2010.\n\n[7] R. Gilad-bachrach, A. Navot, and N. Tishby. Query by committee made real. In Advances in\n\nNeural Information Processing Systems, 2005.\n\n[8] N. Gordillo, E. Montseny, and P. Sobrevilla. State of the art survey on MRI brain tumor\n\nsegmentation. Magnetic Resonance in Medicine, 2013.\n\n[9] S.C.and Hoi, R. Jin, J. Zhu, and M.R. Lyu. Batch mode active learning and its application to\n\nmedical image classi\ufb01cation. In International Conference on Machine Learning, 2006.\n\n[10] N. Houlsby, F. Husz\u00e1r, Z. Ghahramani, and M. Lengyel. Bayesian active learning for classi\ufb01ca-\n\ntion and preference learning. arXiv preprint arXiv:1112.5745, 2011.\n\n[11] W.-N. Hsu, , and H.-T. Lin. Active learning by learning. American Association for Arti\ufb01cial\n\nIntelligence Conference, 2015.\n\n[12] S.-J. Huang, R. Jin, and Z.-H. Zhou. Active learning by querying informative and representative\n\nexamples. In Advances in Neural Information Processing Systems, 2010.\n\n[13] J.E. Iglesias, E. Konukoglu, A. Montillo, Z. Tu, and A. Criminisi. Combining generative\nand discriminative models for semantic segmentation. In Information Processing in Medical\nImaging, 2011.\n\n[14] A. J. Joshi, F. Porikli, and N. P. Papanikolopoulos. Scalable active learning for multiclass image\n\nclassi\ufb01cation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.\n\n[15] A.J. Joshi, F. Porikli, and N. Papanikolopoulos. Multi-class active learning for image classi\ufb01ca-\n\ntion. In Conference on Computer Vision and Pattern Recognition, 2009.\n\n[16] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell. Active learning with Gaussian Processes\n\nfor object categorization. In International Conference on Computer Vision, 2007.\n\n[17] K. Konyushkova, R. Sznitman, and P. Fua. Introducing geometry into active learning for image\n\nsegmentation. In International Conference on Computer Vision, 2015.\n\n10\n\n\f[18] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.\n\nIn Conference on Computer Vision and Pattern Recognition, 2015.\n\n[19] A. C. Lorena, G. E. A. P. A. Batista, A. C. P. L. F. de Carvalho, and M. C. Monard. Splice junction\nrecognition using machine learning techniques. In Brazilian Workshop on Bioinformatics, 2002.\n[20] A. Lucchi, Y. Li, K. Smith, and P. Fua. Structured image segmentation using kernelized features.\n\nIn European Conference on Computer Vision, 2012.\n\n[21] T. Luo, K. Kramer, S. Samson, A. Remsen, D. B. Goldgof, L. O. Hall, and T. Hopkins. Active\nIn International Conference on Pattern\n\nlearning to recognize multiple types of plankton.\nRecognition, 2004.\n\n[22] L. Maystre and M. Grossglauser. Just sort it! A simple and effective approach to active\n\npreference learning. In International Conference on Machine Learning, 2017.\n\n[23] B. Menza, A. Jacas, et al. The multimodal brain tumor image segmentation benchmark (BRATS).\n\nIEEE Transactions on Medical Imaging, 2014.\n\n[24] A. Mosinska, R. Sznitman, P. Glowacki, and P. Fua. Active learning for delineation of curvilinear\n\nstructures. In Conference on Computer Vision and Pattern Recognition, 2016.\n\n[25] F. Olsson. A literature survey of active machine learning in the context of natural language\n\nprocessing. Swedish Institute of Computer Science, 2009.\n\n[26] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with\nmemory-augmented neural networks. In International Conference on Machine Learning, 2016.\n[27] B. Settles. Active learning literature survey. Technical report, University of Wisconsin\u2013Madison,\n\n2010.\n\n[28] B. Settles and M. Craven. An analysis of active learning strategies for sequence labeling tasks.\n\nIn Conference on Empirical Methods in Natural Language Processing, 2008.\n\n[29] A. Singla, S. Tschiatschek, and A. Krause. Actively learning hemimetrics with applications to\n\neliciting user preferences. In International Conference on Machine Learning, 2016.\n\n[30] R. Sznitman and B. Jedynak. Active testing for face detection and localization. IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, 2010.\n\n[31] A. Tamar, Y. WU, G. Thomas, S. Levine, and P. Abbeel. Value iteration networks. In Advances\n\nin Neural Information Processing Systems, 2016.\n\n[32] S. Tong and D. Koller. Support vector machine active learning with applications to text\n\nclassi\ufb01cation. Machine Learning, 2002.\n\n[33] A. Vezhnevets, V. Ferrari, and J.M. Buhmann. Weakly supervised structured output learning for\n\nsemantic segmentation. In Conference on Computer Vision and Pattern Recognition, 2012.\n\n[34] Y. Yang, Z. Ma, F. Nie, X. Chang, and A. G. Hauptmann. Multi-class active learning by\nuncertainty sampling with diversity maximization. International Journal of Computer Vision,\n2015.\n\n11\n\n\f", "award": [], "sourceid": 2224, "authors": [{"given_name": "Ksenia", "family_name": "Konyushkova", "institution": "EPFL"}, {"given_name": "Raphael", "family_name": "Sznitman", "institution": "University of Bern"}, {"given_name": "Pascal", "family_name": "Fua", "institution": null}]}