{"title": "Discriminative Batch Mode Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 593, "page_last": 600, "abstract": "Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classifier. Most previous studies in active learning have focused on selecting one unlabeled instance at one time while retraining in each iteration. However, single instance selection systems are unable to exploit a parallelized labeler when one is available. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration, guided by some heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formuated to maximize the discriminative classification performance of the target classifier, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms.", "full_text": "Discriminative Batch Mode Active Learning\n\nYuhong Guo and Dale Schuurmans\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nfyuhong, daleg@cs.ualberta.ca\n\nAbstract\n\nActive learning sequentially selects unlabeled instances to label with the goal of\nreducing the effort needed to learn a good classi(cid:2)er. Most previous studies in ac-\ntive learning have focused on selecting one unlabeled instance to label at a time\nwhile retraining in each iteration. Recently a few batch mode active learning\napproaches have been proposed that select a set of most informative unlabeled\ninstances in each iteration under the guidance of heuristic scores. In this paper,\nwe propose a discriminative batch mode active learning approach that formulates\nthe instance selection task as a continuous optimization problem over auxiliary\ninstance selection variables. The optimization is formulated to maximize the dis-\ncriminative classi(cid:2)cation performance of the target classi(cid:2)er, while also taking\nthe unlabeled data into account. Although the objective is not convex, we can\nmanipulate a quasi-Newton method to obtain a good local solution. Our empirical\nstudies on UCI datasets show that the proposed active learning is more effective\nthan current state-of-the art batch mode active learning algorithms.\n\n1 Introduction\n\nLearning a good classi(cid:2)er requires a suf(cid:2)cient number of labeled training instances. In many cir-\ncumstances, unlabeled instances are easy to obtain, while labeling is expensive or time consuming.\nFor example, it is easy to download a large number of webpages, however, it typically requires man-\nual effort to produce classi(cid:2)cation labels for these pages. Randomly selecting unlabeled instances\nfor labeling is inef(cid:2)cient in many situations, since non-informative or redundant instances might be\nselected. Hence, active learning (i.e., selective sampling) methods have been adopted to control the\nlabeling process in many areas of machine learning, with the goal of reducing the overall labeling\neffort.\nGiven a large pool of unlabeled instances, active learning provides a way to iteratively select the\nmost informative unlabeled instances(cid:151)the queries(cid:151)to label. This is the typical setting of pool-\nbased active learning. Most active learning approaches, however, have focused on selecting only one\nunlabeled instance at one time, while retraining the classi(cid:2)er on each iteration. When the training\nprocess is hard or time consuming, this repeated retraining is inef(cid:2)cient. Furthermore, if a parallel\nlabeling system is available, a single instance selection system can make wasteful use of the re-\nsource. Thus, a batch mode active learning strategy that selects multiple instances each time is more\nappropriate under these circumstances. Note that simply using a single instance selection strategy to\nselect more than one unlabeled instance in each iteration does not work well, since it fails to take the\ninformation overlap between the multiple instances into account. Principles for batch mode active\nlearning need to be developed to address the multi-instance selection speci(cid:2)cally. In fact, a few\nbatch mode active learning approaches have been proposed recently [2, 8, 9, 17, 19]. However, most\nextend existing single instance selection strategies into multi-instance selection simply by using a\nheuristic score or greedy procedure to ensure both the instance diversity and informativeness.\n\n\fIn this paper, we propose a new discriminative batch mode active learning strategy that exploits\ninformation from an unlabeled set to attempt to learn a good classi(cid:2)er directly. We de(cid:2)ne a good\nclassi(cid:2)er to be one that obtains high likelihood on the labeled training instances and low uncertainty\non labels of the unlabeled instances. We therefore formulate the instance selection problem as an\noptimization problem with respect to auxiliary instance selection variables, taking a combination\nof discriminative classi(cid:2)cation performance and label uncertainty as the objective function. Un-\nfortunately, this optimization problem is NP-hard, thus seeking the optimal solution is intractable.\nHowever, we can approximate it locally using a second order Taylor expansion and obtain a subop-\ntimal solution using a quasi-Newton local optimization technique.\nThe instance selection variables we introduce can be interpreted as indicating self-supervised, op-\ntimistic guesses for the labels of the selected unlabeled instances. A concern about the instance\nselection process, therefore, is that some information in the unlabeled data that is inconsistent with\nthe true classi(cid:2)cation partition might mislead instance selection. Fortunately, the active learning\nmethod can immediately tell whether it has been misled, by comparing the true labels with its opti-\nmized guesses. Therefore, one can then adjust the active selection strategy to avoid such over-(cid:2)tting\nin the next iteration, whenever a mismatch between the labeled and unlabeled data has been detected.\nAn empirical study on UCI datasets shows that the proposed batch mode active learning method is\nmore effective than some current state-of-the-art batch mode active learning algorithms.\n\n2 Related Work\n\nMany researchers have addressed the active learning problem in a variety of ways. Most have\nfocused on selecting a single most informative unlabeled instance to label at a time. Many such\napproaches therefore make myopic decisions based solely on the current learned classi(cid:2)er, and select\nthe unlabeled instance for which there is the greatest uncertainty. [10] chooses the unlabeled instance\nwith conditional probability closest to 0.5 as the most uncertain instance. [5] takes the instance on\nwhich a committee of classi(cid:2)ers disagree the most. [3, 18] suggest choosing the instance closest\nto the classi(cid:2)cation boundary, where [18] analyzes this active learning strategy as a version space\nreduction process. Approaches that exploit unlabeled data to provide complementary information\nfor active learning have also been proposed. [4, 20] exploit unlabeled data by using the prior density\np(x) as uncertainty weights. [16] selects the instance that optimizes the expected generalization error\nover the unlabeled data. [11] uses an EM approach to integrate information from unlabeled data. [13,\n22] consider combining active learning with semi-supervised learning. [14] presents a mathematical\nmodel that explicitly combines clustering and active learning. [7] presents a discriminative approach\nthat implicitly exploits the clustering information contained in the unlabeled data by considering\noptimistic labelings.\nSince single instance selection strategies require tedious retraining with each instance labeled (and,\nmoreover, since they cannot take advantage of parallel labeling systems), many batch mode active\nlearning methods have recently been proposed. [2, 17, 19] extend single instance selection strategies\nthat use support vector machines. [2] takes the diversity of the selected instances into account, in\naddition to individual informativeness. [19] proposes a representative sampling approach that selects\nthe cluster centers of the instances lying within the margin of a support vector machine.\n[8, 9]\nchoose multiple instances that ef(cid:2)ciently reduce the Fisher information. Overall, these approaches\nuse a variety of heuristics to guide the instance selection process, where the selected batch should\nbe informative about the classi(cid:2)cation model while being diverse enough so that their information\noverlap is minimized.\nInstead of using heuristic measures, in this paper, we formulate batch mode active learning as an\noptimization problem that aims to learn a good classi(cid:2)er directly. Our optimization selects the best\nset of unlabeled instances and their labels to produce a classi(cid:2)er that attains maximum likelihood\non labels of the labeled instances while attaining minimum uncertainty on labels of the unlabeled\ninstances. It is intractable to conduct an exhaustive search for the optimal solution; our optimization\nproblem is NP-hard. Nevertheless we can exploit a second-order Taylor approximation and use\na quasi-Newton optimization method to quickly reach a local solution. Our proposed approach\nprovides an example of exploiting optimization techniques in batch model active learning research,\nmuch like other areas of machine learning where optimization techniques have been widely applied\n[1].\n\n\f3 Logistic Regression\n\nIn this paper, we use binary logistic regression as the base classi(cid:2)cation algorithm. Logistic re-\ngression is a well-known and mature statistical model for probabilistic classi(cid:2)cation that has been\nactively studied and applied in machine learning. Given a test instance x, binary logistic regression\nmodels the conditional probability of the class label y 2 f+1; (cid:0)1g by\n\np(yjx; w) =\n\n1\n\n1 + exp((cid:0)yw>x)\n\nwhere w is the model parameter. Here the bias term is omitted for simplicity of notation. The model\nparameters can be trained by maximizing the likelihood of the labeled training data, i.e., minimizing\nthe logloss of the training instances\n\nw X\nmin\n\ni2L\n\nlog(1 + exp((cid:0)yiw>xi)) +\n\nw>w\n\n(cid:21)\n2\n\n(1)\n\n2 w>w is a regularization term introduced to avoid\nwhere L indexes the training instances, and (cid:21)\nover-(cid:2)tting problems. Logistic regression is a robust classi(cid:2)er that can be trained ef(cid:2)ciently using\nvarious convex optimization techniques [12]. Although it is a linear classi(cid:2)er, it is easy to obtain\nnonlinear classi(cid:2)cations by simply introducing kernels [21].\n\n4 Discriminative Batch Mode Active Learning\n\nFor active learning, one typically encounters a small number of labeled instances and a large number\nof unlabeled instances. Instance selection strategies based only on the labeled data therefore ignore\npotentially useful information embodied in the unlabeled instances.\nIn this section, we present\na new discriminative batch mode active learning algorithm for binary classi(cid:2)cation that exploits\ninformation in the unlabeled instances. The proposed approach is discriminative in the sense that\n(1) it selects a batch of instances by optimizing a discriminative classi(cid:2)cation model; and (2) it\nselects instances by considering the best discriminative con(cid:2)guration of their labels leading to the\nbest classi(cid:2)er. Unlike other batch mode active learning methods, which identify the most informative\nbatch of instances using heuristic measures, our approach aims to identify the batch of instances that\ndirectly optimizes classi(cid:2)cation performance.\n\n4.1 Optimization Problem\n\nAn optimal active learning strategy selects a set of instances to label that leads to learning the best\nclassi(cid:2)er. We assume the learner selects a set of a (cid:2)xed size m, which is chosen as a parameter. Su-\npervised learning methods typically maximize the likelihood of training instances. With unlabeled\ndata being available, semi-supervised learning methods have been proposed that train by simultane-\nously maximizing the likelihood of labeled instances and minimizing the uncertainty of the labels\nfor unlabeled instances [6]. That is, to achieve a classi(cid:2)er with better generalization performance,\none can maximizing the expected log likelihood of the labeled data and minimize the entropy of the\nmissing labels on the unlabeled data, according to\nX\n\nlog P (yijxi; w) + (cid:11) X\n\nP (yjxj; w) log P (yjxj; w)\n\n(2)\n\nX\n\ni2L\n\nj2U\n\ny=(cid:6)1\n\nwhere (cid:11) is a tradeoff parameter used to adjust the relative in(cid:3)uence of the labeled and unlabeled data,\nw speci(cid:2)es the conditional model, L indexes the labeled instances, and U indexes the unlabeled\ninstances.\nThe new active learning approach we propose is motivated by this semi-supervised learning princi-\nple. We propose to select a batch of m unlabeled instances, S, to label in each iteration from the\ntotal unlabeled set U , with the goal of maximizing the objective (2). Speci(cid:2)cally, we de(cid:2)ne the\nscore function for a set of selected instances S in iteration t + 1 as follows\n\nf (S) = X\n\ni2Lt[S\n\nlog P (yijxi; wt+1) (cid:0) (cid:11) X\n\nj2U tnS\n\nH(yjxj ; wt+1)\n\n(3)\n\n\fwhere wt+1 is the parameter set for the conditional classi(cid:2)cation model trained on the new la-\nbeled set Lt+1 = Lt [ S, and H(yjxj ; wt+1) denotes the entropy of the conditional distribution\nP (yjxj; wt+1), such that\n\nH(yjxj ; wt+1) = (cid:0) X\n\ny=(cid:6)1\n\nP (yjxj; wt+1) log P (yjxj; wt+1)\n\nThe proposed active learning strategy is to select the batch of instances that has the highest score.\nIn practice, however it is problematic to use the f (S) score directly to guide instance selection: the\nlabels for instances S are not known when the selection is conducted. One typical solution for this\nproblem is to use the expected f (S) score computed under the current conditional model speci(cid:2)ed\nby wt\n\nE[f (S)] = X\n\nyS\n\nP (ySjxS; wt)f (S)\n\nHowever, using P (ySjxS; wt) as weights, this expectation might aggravate any ambiguity that al-\nready exists in the current classi(cid:2)cation model wt, since it has been trained on a very small labeled\nset Lt. Instead, we propose an optimistic strategy: use the best f (S) score that the batch of unla-\nbeled instances S can achieve over all possible label con(cid:2)gurations. This optimistic scoring function\ncan be written as\n\nf (S) = max\n\nyS X\n\ni2Lt[S\n\nlog P (yijxi; wt+1) (cid:0) (cid:11) X\n\nj2U tnS\n\nH(yjxj ; wt+1)\n\n(4)\n\nThus the problem becomes how to select a set of instances S that achieves the best optimistic f (S)\nscore de(cid:2)ned in (4). Although this problem can be solved using an exhaustive search on all size\nm subsets, S, of the unlabeled set U, it is intractable to do so in practice since the search space is\nexponentially large. Explicit heuristic search approaches seeking a local optima do not exist either,\nsince it is hard to de(cid:2)ne an ef(cid:2)cient set of operators that can transfer from one position to another\none within the search space while guaranteeing improvements to the optimistic score.\nInstead, in this paper we propose to approach the problem by formulating optimistic batch mode\nactive learning as an explicit mathematical optimization. Given the labeled set Lt and unlabeled set\nU t after iteration t, the task in iteration t + 1 is to select a size m subset S from U t that achieves the\nbest score de(cid:2)ned in (4). To do so, we (cid:2)rst introduce a set of f0; 1g-valued instance selection vari-\nables (cid:22). In particular, (cid:22) is a jU tj (cid:2) 2 sized indicator matrix, where each row vector (cid:22)j corresponds\nto the two possible labels f+1; (cid:0)1g of the jth instance in U t. Then the optimistic instance selection\nfor iteration t + 1 can be formulated as the following optimization problem\n\n(cid:22) X\nmax\n\nlog P (yijxi; wt+1) + (cid:12) X\n\nj2U t\n\ns:t:\n\ni2Lt\n(cid:22) 2 f0; 1gjU tj(cid:2)2\n(cid:22) (cid:15) E = m\n(cid:22)j e (cid:20) 1; 8j\n(cid:22) (cid:20) (cid:16) 1\n\n1>\n\n2\n\n+ (cid:15)(cid:17)me>\n\nvt+1\n\nj (cid:22)\n\n>\n\nj (cid:0) (cid:11) X\n\nj2U t\n\n(1 (cid:0) (cid:22)j e)H(yjxj ; wt+1)\n\n(5)\n\n(6)\n(7)\n(8)\n\n(9)\n\nj\n\nwhere vt+1\nis a row vector [log P (y = 1jxj ; wt+1); log P (y = (cid:0)1jxj ; wt+1)]; e is a 2-entry\ncolumn vector with all 1s; 1 is a jU tj-entry column vector with all 1s; E is a U t (cid:2) 2 sized matrix\nwith all 1s; (cid:15) is matrix inner product; (cid:15) is a user-provided parameter that controls class balance\nduring instance selection; and (cid:12) is a parameter that we will use later to adjust our belief in the\nguessed labels. Note that, the selection variables (cid:22) not only choose instances from U t, but also\nselect labels for the selected instances. Solving this optimization yields the optimal (cid:22) for instance\nselection in iteration t + 1.\nThe optimization problem (5) is an integer programming problem that produces equivalent results\nto using exhaustive search to optimize (4), except that we have additional class balance constraints\n(9). Integer programming is an NP-hard problem. Thus, the (cid:2)rst step toward solving this problem\nin practice is to relax it into a continuous optimization by replacing the integer constraints (6) with\n\n\fcontinuous constraints 0 (cid:20) (cid:22) (cid:20) 1, yielding the relaxed formulation\nj (cid:0) (cid:11) X\n\nlog P (yijxi; wt+1) + (cid:12) X\n\n(cid:22) X\nmax\n\nvt+1\n\nj (cid:22)\n\n>\n\nj2U t\n\nj2U t\n\ns:t:\n\ni2Lt\n0 (cid:20) (cid:22) (cid:20) 1\n(cid:22) (cid:15) E = m\n(cid:22)j e (cid:20) 1; 8j\n(cid:22) (cid:20) (cid:16) 1\n\n1>\n\n(1 (cid:0) (cid:22)j e)H(yjxj ; wt+1) (10)\n\n(11)\n(12)\n(13)\n\n2\n\n+ (cid:15)(cid:17)me>\n\n(14)\nIf we can solve this continuous optimization problem, a greedy strategy can then be used to recover\nthe integer solution by iteratively setting the largest non-integer (cid:22) value to 1 with respect to the\nconstraints. However, this relaxed optimization problem is still very complex: the objective function\n(10) is not a concave function of (cid:22).1 Nevertheless, standard continuous optimization techniques can\nbe used to solve for a local maxima.\n\n4.2 Quasi-Newton Method\n\nTo derive a local optimization technique, consider the objective function (10) as a function of the\ninstance selection variables (cid:22)\n\nf ((cid:22)) = X\n\ni2Lt\n\nlog P (yijxi; wt+1) + (cid:12) X\n\nj2U t\n\nvt+1\n\nj (cid:22)\n\n>\n\nj (cid:0) (cid:11) X\n\nj2U t\n\n(1 (cid:0) (cid:22)j e)H(yjxj ; wt+1)\n\n(15)\n\n1\n2\n\nk vec((cid:22) (cid:0) (cid:22)(cid:22)(k)) +\n\n~f ((cid:22)) = f ( (cid:22)(cid:22)(k)) + rf >\n\nAs noted, this function is non-concave, therefore convenient convex optimization techniques that\nachieve global optimal solutions cannot be applied. Nevertheless, a local optimization approach\nexploiting quasi-Newton methods can quickly determine a local optimal solution (cid:22)\n(cid:3). Such a local\noptimization approach iteratively updates (cid:22) to improve the objective (15), and stops when a local\nmaximum is reached. At each iteration, it makes a local move that allows it to achieve the largest\nimprovement in the objective function along the direction decided by cumulative information ob-\ntained from the sequence of local gradients. Suppose (cid:22)(cid:22)(k) is the starting point for iteration k. We\n(cid:2)rst derive a second-order Taylor approximation ~f ((cid:22)) for the objective function f ((cid:22)) at (cid:22)(cid:22)(k)\nvec((cid:22) (cid:0) (cid:22)(cid:22)(k))> Hk vec((cid:22) (cid:0) (cid:22)(cid:22)(k))\n\n(16)\nwhere vec((cid:1)) is a function that transforms a matrix into a column vector, and rfk = rf ( (cid:22)(cid:22)(k)) and\nHk denote the gradient vector and Hessian matrix of f ((cid:22)) at point (cid:22)(cid:22)(k), respectively. Since our\noriginal optimization function f ((cid:22)) is smooth, the quadratic function ~f ((cid:22)) can reasonably approx-\nimate it in a small neighborhood of (cid:22)(cid:22)(k). Thus we can determine our update direction by solving\na quadratic programming with the objective (16) and linear constraints (11), (12), (13) and (14).\nSuppose the optimal solution for this quadratic program is ~(cid:22)\n(k). Then a reasonable update direction\n(k) (cid:0) (cid:22)(cid:22)(k) can be obtained for iteration k. Given this direction, a backtrack line search can be\n(cid:3)\ndk = ~(cid:22)\nused to guarantee improvement over the original objective (15). Note that for each different value of\n(cid:22), wt+1 has to be retrained on Lt [ S to evaluate the new objective value, since S is determined by\n(cid:22). In order to reduce the computational cost, we approximate the training of wt+1 in our empirical\nstudy, by limiting it to a few Newton-steps with a starting point given by wt trained only on Lt.\nThe remaining issue is to compute the local gradient rf ( (cid:22)(cid:22)(k)) and the Hessian matrix Hk. We\nassume wt+1 remains constant with small local updates on (cid:22)(cid:22). Thus the local gradient can be ap-\nproximated as\n\n(cid:3)\n\nrf ( (cid:22)(cid:22)j(k)) = (cid:12) vt+1\n\nj + (cid:11) [H(yjxj ; wt+1); H(yjxj ; wt+1)]\n\nand therefore rf ( (cid:22)(cid:22)(k)) can be constructed from the individual rf ( (cid:22)(cid:22)j(k)). We then use BFGS\n(Broyden-Fletcher-Goldfarb-Shanno) to compute the Hessian matrix, which starts as an identity\nmatrix for the (cid:2)rst iteration, and is updated in each iteration as follows [15]\n\nHk+1 = Hk (cid:0)\n\nHksks>\n\nk Hk\n\ns>\nk Hksk\n\n+\n\nyky>\nk\ny>\nk sk\n\n1Note that w\n\nt+1 is the classi(cid:2)cation model parameter set trained on Lt+1\n\n= Lt\n\n[ S, where S indexes the\n\nunlabeled instances selected by (cid:22). Therefore w\n\nt+1 is a function of (cid:22).\n\n\fwhere yk = rfk+1 (cid:0) rfk, and sk = (cid:22)(cid:22)(k+1) (cid:0) (cid:22)(cid:22)(k). This Hessian matrix accumulates information\nfrom the sequences of local gradients to help determine better update directions.\n\n4.3 Adjustment Strategy\n\nIn the discriminative optimization problem formulated in Section 4.1, the (cid:22) variables are used to\noptimistically select both instances and their labels, with the goal of achieving the best classi(cid:2)cation\nmodel according to the objective (5). However, when the labeled set is small and the discriminative\npartition (clustering) information contained in the large unlabeled set is inconsistent with the true\nclassi(cid:2)cation, the labels optimistically guessed for the selected instances through (cid:22) might not match\nthe underlying true labels. When this occurs, the instance selected will not be very useful for iden-\ntifying the true classi(cid:2)cation model. Furthermore, the unlabeled data might continue to mislead the\nnext instance selection iteration.\nFortunately, we can immediately identify when the process has been misled once the true labels for\nthe selected instances have been obtained. If the true labels are different from the labels guessed by\nthe optimization, we need to make an adjustment for the next instance selection iteration. We have\ntried a few adjustment strategies in our study, but report the most effective one in this paper. Note\nthat the being-misled problem is caused by the unlabeled data, which affects the target classi(cid:2)cation\nj . Therefore, a simple way to (cid:2)x the problem is to adjust\nmodel through the term (cid:12) Pj2U t vt+1\nthe parameter (cid:12). Speci(cid:2)cally, at the end of each iteration t, we obtain the true labels yS for the\nselected instances S, and compare them with our guessed labels ^yS indicated by (cid:22)\n(cid:3). If they are\nconsistent, we will set (cid:12) = 1, which means we trust the partition information from the unlabeled\ndata as same as the label information in the labeled data for building the classi(cid:2)cation model. If\nyS 6= ^yS, apparently we should reduce the (cid:12) value, that is, reducing the in(cid:3)uence of the unlabeled\ndata for the next selection iteration t + 1. We use a simple heuristic procedure to determine the (cid:12)\nvalue in this case. Starting from (cid:12) = 1, we then multiplicatively reduce its value by a small factor,\n0:5, until a better objective value for (15) can be obtained when replacing the guessed indicator\nvariables (cid:22)\n(cid:3) with the true label indicators. Note that, if we reduce (cid:12) to zero, our optimization\nproblem will be exactly equivalent to picking the most uncertain instance (when m = 1).\n\nj (cid:22)\n\n>\n\n5 Experiments\n\nTo investigate the empirical performance of the proposed discriminative batch mode active learning\nalgorithm (Discriminative), we conducted a set of experiments on nine two-class UCI datasets, com-\nparing with a baseline random instance selection algorithm (Random), a non-batch myopic active\nlearning method that selects the most uncertain instance each time (MostUncertain), and two batch\nmode active learning methods proposed in the literature: svmD, an approach that incorporates diver-\nsity in active learning with SVM [2]; and Fisher, an approach that uses Fisher information matrix for\ninstance selection [9]. The UCI datasets we used include (we show the name, followed by the num-\nber of instances and the number of attributes): Australian(690;14), Cleve(303;13), Corral(128;6),\nCrx(690;15), Flare(1066;10), Glass2(163;9), Heart(270;13), Hepatitis(155;20) and Vote(435;15).\nWe consider a hard case of active learning, where only a few labeled instances are given at the\nstart. In each experiment, we start with four randomly selected labeled instances, two in each class.\nWe then randomly select 2/3 of the remaining instances as the unlabeled set, using the remaining\ninstances for testing. All the algorithms start with the same initial labeled set, unlabeled set and\ntesting set. For a (cid:2)xed batch size m, each algorithm repeatedly select m instances to label each time.\nIn this section, we report the experimental results with m = 5, averaged over 20 times repetitions.\nFigure 1 shows the comparison results on the nine UCI datasets. These results suggest that although\nthe baseline random sampling method, Random, works surprisingly well in our experiments, the\nproposed algorithm, Discriminative, always performs better or at least achieves a comparable per-\nformance. Moreover, Discriminative also apparently outperforms the other two batch mode algo-\nrithms, svmD and Fisher, on (cid:2)ve datasets(cid:151)Australian, Cleve, Flare, Heart and Hepatitis, and reaches\na tie on two datasets(cid:151)Crx and Vote. The myopic most uncertain selection method, MostUncertain,\nshows an overall inferior performance to Discriminative on Australian, Cleve, Crx, Heart and Hep-\natitis, and achieves a tie on Flare and Vote. However, Discriminative demonstrates weak perfor-\n\n\faustralian\n\ncleve\n\ncorral\n\n \n\n0.85\n\n \n\n \n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n \n0\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n \n0\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n \n0\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n \n0\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n \n0\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n \n0\n\n5\n\nRandom\nMostUncertain\nsvmD\nFisher\nDiscriminative\n\n20\n80\nNumber of Labeled Instances\n\n40\n\n60\n\ncrx\n\n100\n\n \n\nRandom\nMostUncertain\nsvmD\nFisher\nDiscriminative\n\n20\n80\nNumber of Labeled Instances\n\n40\n\n60\n\nheart\n\n100\n\n \n\nRandom\nMostUncertain\nsvmD\nFisher\nDiscriminative\n\n20\n80\nNumber of Labeled Instances\n\n60\n\n40\n\n100\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.9\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n \n0\n\n10\n\nRandom\nMostUncertain\nsvmD\nFisher\nDiscriminative\n\n20\n80\nNumber of Labeled Instances\n\n40\n\n60\n\n100\n\nflare\n\n \n\n0.75\n\nglass2\n\n20\n\n30\n\nNumber of Labeled Instances\n\n40\n\n50\n\n60\n\nRandom\nMostUncertain\nsvmD\nFisher\nDiscriminative\n\n70\n\n80\n\n \n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n \n0\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.9\n\n0.85\n\n0.8\n \n0\n\nRandom\nMostUncertain\nsvmD\nFisher\nDiscriminative\n\n20\n80\nNumber of Labeled Instances\n\n60\n\n40\n\nhepatitis\n\n100\n\n \n\nRandom\nMostUncertain\nsvmD\nFisher\nDiscriminative\n\n15\n\n10\n35\nNumber of Labeled Instances\n\n25\n\n30\n\n20\n\n40\n\n45\n\nRandom\nMostUncertain\nsvmD\nFisher\nDiscriminative\n\n20\n80\nNumber of Labeled Instances\n\n40\n\n60\n\nvote\n\n100\n\n \n\nRandom\nMostUncertain\nsvmD\nFisher\nDiscriminative\n\n20\n80\nNumber of Labeled Instances\n\n60\n\n40\n\n100\n\nFigure 1: Results on UCI Datasets\n\nmance on two datasets(cid:151)Corral and Glass2, where the evaluation lines for most algorithms in the\n(cid:2)gures are strangely very bumpy. The reason behind this remains to be investigated.\nThese empirical results suggest that selecting unlabeled instances through optimizing the classi(cid:2)-\ncation model directly would obtain more relevant and informative instances, comparing with using\nheuristic scores to guide the selection. Although the original optimization problem formulated is\nNP-hard, a relaxed local optimization method that leads to a local optimal solution still works effec-\ntively.\n\n6 Conclusion\n\nIn this paper, we proposed a discriminative batch mode active learning approach that exploits in-\nformation in unlabeled data and selects a batch of instances by optimizing the target classi(cid:2)cation\nmodel. Although the proposed technique could be overly optimistic about the information presented\nby the unlabeled set, and consequently be misled, this problem can be identi(cid:2)ed immediately after\nobtaining the true labels. A simple adjustment strategy can then be used to rectify the problem in the\nfollowing iteration. Experimental results on UCI datasets show that this approach is generally more\neffective comparing with other batch mode active learning methods, a random sampling method,\nand a myopic non-batch mode active learning method. Our current work is focused on 2-class clas-\nsi(cid:2)cation problems, however, it is easy to be extended to multiclass classi(cid:2)cation problems.\n\n\fReferences\n[1] K. Bennett and E. Parrado-Hernandez. The interplay of optimization and machine learning\n\nresearch. Journal of Machine Learning Research, 7, 2006.\n\n[2] K. Brinker. Incorporating diversity in active learning with support vector machines. In Pro-\n\nceedings of the 20th International Conference on Machine learning, 2003.\n\n[3] C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classi(cid:2)ers. In\n\nProceedings of the 17th International Conference on Machine Learning, 2000.\n\n[4] D. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models. Journal of\n\nArti\ufb01cial Intelligence Research, 4, 1996.\n\n[5] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by\n\ncommittee algorithm. Machine Learning, 28, 1997.\n\n[6] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In Advances\n\nin Neural Information Processing Systems, 2005.\n\n[7] Y. Guo and R. Greiner. Optimistic active learning using mutual information. In Proceedings\n\nof the International Joint Conference on Arti\ufb01cial Intelligence, 2007.\n\n[8] S. Hoi, R. Jin, and M. Lyu. Large-scale text categorization by batch mode active learning. In\n\nProceedings of the International World Wide Web Conference, 2006.\n\n[9] S. Hoi, R. Jin, J. Zhu, and M. Lyu. Batch mode active learning and its application to med-\nIn Proceedings of the 23rd International Conference on Machine\n\nical image classi(cid:2)cation.\nLearning, 2006.\n\n[10] D. Lewis and W. Gale. A sequential algorithm for training text classi(cid:2)ers. In Proceedings of the\nInternational ACM-SIGIR Conference on Research and Development in Information Retrieval,\n1994.\n\n[11] A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classi(cid:2)ca-\n\ntion. In Proceedings of the 15th International Conference on Machine Learning, 1998.\n\n[12] T. Minka. A comparison of numerical optimizers for logistic regression. Technical report,\n\n2003. http://research.microsoft.com/ minka/papers/logreg/.\n\n[13] I. Muslea, S. Minton, and C. Knoblock. Active + semi-supervised learning = robust multi-view\n\nlearning. In Proceedings of the 19th International Conference on Machine Learning, 2002.\n\n[14] H. Nguyen and A. Smeulders. Active learning using pre-clustering. In Proceedings of the 21st\n\nInternational Conference on Machine Learning, 2004.\n\n[15] J. Nocedal and S.J. Wright. Numerical Optimization. Springer, New York, 1999.\n[16] N. Roy and A. McCallum. Toward optimal active learning through sampling estimation of\nerror reduction. In Proceedings of the 18th International Conference on Machine Learning,\n2001.\n\n[17] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines.\n\nProceedings of the 17th International Conference on Machine Learning, 2000.\n\nIn\n\n[18] S. Tong and D. Koller. Support vector machine active learning with applications to text classi-\n\n(cid:2)cation. In Proceedings of the 17th International Conference on Machine Learning, 2000.\n\n[19] Z. Xu, K. Yu, V. Tresp, X. Xu, and J. Wang. Representative sampling for text classi(cid:2)cation us-\ning support vector machines. In Proceedings of the 25th European Conference on Information\nRetrieval Research, 2003.\n\n[20] C. Zhang and T. Chen. An active learning framework for content-based information retrieval.\n\nIEEE Trans on Multimedia, 4:260(cid:150)258, 2002.\n\n[21] J. Zhu and T. Hastie. Kernel logistic regression and the import vector machine. Journal of\n\nComputational and Graphical Statistics, 14, 2005.\n\n[22] X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semi-supervised learn-\ning using gaussian (cid:2)elds and harmonic functions. In ICML Workshop on The Continuum from\nLabeled to Unlabeled Data in Machine Learning and Data Mining, 2003.\n\n\f", "award": [], "sourceid": 922, "authors": [{"given_name": "Yuhong", "family_name": "Guo", "institution": null}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": null}]}