{"title": "Multi-Level Active Prediction of Useful Image Annotations for Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1705, "page_last": 1712, "abstract": "We introduce a framework for actively learning visual categories from a mixture of weakly and strongly labeled image examples. We propose to allow the category-learner to strategically choose what annotations it receives---based on both the expected reduction in uncertainty as well as the relative costs of obtaining each annotation. We construct a multiple-instance discriminative classifier based on the initial training data. Then all remaining unlabeled and weakly labeled examples are surveyed to actively determine which annotation ought to be requested next. After each request, the current classifier is incrementally updated. Unlike previous work, our approach accounts for the fact that the optimal use of manual annotation may call for a combination of labels at multiple levels of granularity (e.g., a full segmentation on some images and a present/absent flag on others). As a result, it is possible to learn more accurate category models with a lower total expenditure of manual annotation effort.", "full_text": "Multi-Level Active Prediction of Useful Image\n\nAnnotations for Recognition\n\nSudheendra Vijayanarasimhan and Kristen Grauman\n\nDepartment of Computer Sciences\n\nUniversity of Texas at Austin\n\n{svnaras,grauman}@cs.utexas.edu\n\nAbstract\n\nWe introduce a framework for actively learning visual categories from a mixture of\nweakly and strongly labeled image examples. We propose to allow the category-\nlearner to strategically choose what annotations it receives\u2014based on both the\nexpected reduction in uncertainty as well as the relative costs of obtaining each\nannotation. We construct a multiple-instance discriminative classi\ufb01er based on the\ninitial training data. Then all remaining unlabeled and weakly labeled examples\nare surveyed to actively determine which annotation ought to be requested next.\nAfter each request, the current classi\ufb01er is incrementally updated. Unlike previous\nwork, our approach accounts for the fact that the optimal use of manual annotation\nmay call for a combination of labels at multiple levels of granularity (e.g., a full\nsegmentation on some images and a present/absent \ufb02ag on others). As a result, it\nis possible to learn more accurate category models with a lower total expenditure\nof manual annotation effort.\n\n1 Introduction\n\nVisual category recognition is a vital thread in computer vision research. The recognition problem\nremains challenging because of the wide variation in appearance a single class typically exhibits, as\nwell as differences in viewpoint, illumination, and clutter. Methods are usually most reliable when\ngood training sets are available, i.e., when labeled image examples are provided for each class, and\nwhere those training examples are adequately representative of the distribution to be encountered at\ntest time. The extent of an image labeling can range from a \ufb02ag telling whether the object of interest\nis present or absent, to a full segmentation specifying the object boundary. In practice, accuracy\noften improves with larger quantities of training examples and/or more elaborate annotations.\n\nUnfortunately, substantial human effort is required to gather such training sets, making it unclear\nhow the traditional protocol for visual category learning can truly scale. Recent work has begun to\nexplore ways to mitigate the burden of supervision [1\u20138]. While the results are encouraging, exist-\ning techniques fail to address two key insights about low-supervision recognition: 1) the division\nof labor between the machine learner and the human labelers ought to respect any cues regarding\nwhich annotations would be easy (or hard) for either party to provide, and 2) to use a \ufb01xed amount\nof manual effort most effectively may call for a combination of annotations at multiple levels (e.g.,\na full segmentation on some images and a present/absent \ufb02ag on others). Humans ought to be re-\nsponsible for answering the hardest questions, while pattern recognition techniques ought to absorb\nand propagate that information and answer the easier ones. Meanwhile, the learning algorithm must\nbe able to accommodate the multiple levels of granularity that may occur in provided image annota-\ntions, and to compute which item at which of those levels appears to be most fruitful to have labeled\nnext (see Figure 1).\n\n\f \n,\n\ns\n\nl\n\ne\nb\na\n\nl\n \nr\ne\ns\nr\na\no\nC\n\ni\n\ne\nv\ns\nn\ne\np\nx\ne\n\n \n\ns\ns\ne\n\nl\n\nl\n\ne\ns\ns\n\n \n\ne\nx\np\ne\nn\ns\nv\ne\n\ni\n\nC\no\na\nr\ns\ne\nr\n \nl\n\na\nb\ne\n\nl\n\ns\n\n,\n \n\ni\n\ne\nv\ns\nn\ne\np\nx\ne\ne\nr\no\nm\n\n \n\n \n,\n\ns\n\nl\n\ne\nb\na\n\nl\n \nr\ne\nn\nF\n\ni\n\ni\n\nF\nn\ne\nr\n \nl\n\na\nb\ne\n\nl\n\ns\n\n,\n \n\n \n\nm\no\nr\ne\ne\nx\np\ne\nn\ns\nv\ne\n\ni\n\nFig. 1. Useful image annotations can occur at multiple levels of granularity. Left: For example, a learner may\nonly know whether the image contains a particular object or not (top row, dotted boxes denote object is present),\nor it may also have segmented foregrounds (middle row), or it may have detailed outlines of object parts (bottom\nrow). Right: In another scenario, groups of images for a given class are collected with keyword-based Web\nsearch. The learner may only be given the noisy groups and told that each includes at least one instance of the\nspeci\ufb01ed class (top), or, for some groups, the individual example images may be labeled as positive or negative\n(bottom). We propose an active learning paradigm that directs manual annotation effort to the most informative\nexamples and levels.\n\nTo address this challenge, we propose a method that actively targets the learner\u2019s requests for su-\npervision so as to maximize the expected bene\ufb01t to the category models. Our method constructs an\ninitial classi\ufb01er from limited labeled data, and then considers all remaining unlabeled and weakly\nlabeled examples to determine what annotation seems most informative to obtain. Since the varying\nlevels of annotation demand varying degrees of manual effort, our active selection process weighs\nthe value of the information gain against the cost of actually obtaining any given annotation. After\neach request, the current classi\ufb01er is incrementally updated, and the process repeats.\n\nOur approach accounts for the fact that image annotations can exist at multiple levels of granularity:\nboth the classi\ufb01er and active selection objectives are formulated to accommodate dual-layer labels.\nTo achieve this duality for the classi\ufb01er, we express the problem in the multiple instance learning\n(MIL) setting [9], where training examples are speci\ufb01ed as bags of the \ufb01ner granularity instances,\nand positive bags may contain an arbitrary number of negatives. To achieve the duality for the active\nselection, we design a decision-theoretic criterion that balances the variable costs associated with\neach type of annotation with the expected gain in information. Essentially this allows the learner to\nautomatically predict when the extra effort of a more precise annotation is warranted.\n\nThe main contribution of this work is a uni\ufb01ed framework to actively learn categories from a mixture\nof weakly and strongly labeled examples. We are the \ufb01rst to identify and address the problem of\nactive visual category learning with multi-level annotations. In our experiments we demonstrate\ntwo applications of the framework for visual learning (as highlighted in Figure 1). Not only does our\nactive strategy learn more quickly than a random selection baseline, but for a \ufb01xed amount of manual\nresources, it yields more accurate models than conventional single-layer active selection strategies.\n\n2 Related Work\n\nThe recognition community is well-aware of the expense of requiring well-annotated image datasets.\nRecent methods have shown the possibility of learning visual patterns from unlabeled [3, 2] image\ncollections, while other techniques aim to share or re-use knowledge across categories [10, 4]. Sev-\neral authors have successfully leveraged the free but noisy images on the Web [5, 6, 11]. Using\nweakly labeled images to learn categories was proposed in [1], and several researchers have shown\nthat MIL can accommodate the weak or noisy supervision often available for image data [11\u201314].\nWorking in the other direction, some research seeks to facilitate the manual labor of image annota-\ntion, tempting users with games or nice datasets [7, 8].\n\nHowever, when faced with a distribution of unlabeled images, almost all existing methods for vi-\nsual category learning are essentially passive, selecting points at random to label. Active learning\nstrategies introduced in the machine learning literature generally select points so as to minimize the\nmodel entropy or reduce classi\ufb01cation error (e.g., [15, 16]). Decision-theoretic measures for tradi-\ntional (single-instance) learning have been explored in [17, 18], where they were applied to classify\nsynthetic data and voicemail. Our active selection procedure is in part inspired by this work, as it\n\n\falso seeks to balance the cost and utility tradeoff. Recent work has considered active learning with\nGaussian Process classi\ufb01ers [19], and relevance feedback for video annotations [20].\n\nIn contrast, we show how to form active multiple-instance learners, where constraints or labels must\nbe sought at multiple levels of granularity. Further, we introduce the notion of predicting when to\n\u201cinvest\u201d the labor of more expensive image annotations so as to ultimately yield bigger bene\ufb01ts to\nthe classi\ufb01er. Unlike any previous work, our method continually guides the annotation process to\nthe appropriate level of supervision. While an active criterion for instance-level queries is suggested\nin [21] and applied within an MI learner, it cannot actively select positive bags or unlabeled bags,\nand does not consider the cost of obtaining the labels requested. In contrast, we formulate a gen-\neral selection function that handles the full MIL paradigm and adapts according to the label costs.\nExperiments show this functionality to be critical for ef\ufb01cient learning from few images.\n\n3 Approach\n\nThe goal of this work is to learn to recognize an object or category with minimal human intervention.\nThe key idea is to actively determine which annotations a user should be asked to provide, and in\nwhat order. We consider image collections consisting of a variety of supervisory information: some\nimages are labeled as containing the category of interest (or not), some have both a class label\nand a foreground segmentation, while others have no annotations at all. We derive an active learning\ncriterion function that predicts how informative further annotation on any particular unlabeled image\nor region would be, while accounting for the variable expense associated with different annotation\ntypes. As long as the information expected from further annotations outweighs the cost of obtaining\nthem, our algorithm will request the next valuable label, re-train the classi\ufb01er, and repeat.\n\nIn the following we outline the MIL paradigm and discuss its applicability for two important image\nclassi\ufb01cation scenarios. Then, we describe our decision-theoretic approach to actively request useful\nannotations. Finally, we discuss how to attribute costs and risks for multi-level annotations.\n\n3.1 Multiple-Instance Visual Category Learning\n\nTraditional binary supervised classi\ufb01cation assumes the learner is provided a collection of labeled\ndata patterns, and must learn a function to predict labels on new instances. However, the fact that\nimage annotations can exist at multiple levels of granularity demands a learning algorithm that can\nencode any known labels at the levels they occur, and so MIL [9] is more applicable. In MIL, the\nlearner is instead provided with sets (bags) of patterns rather than individual patterns, and is only told\nthat at least one member of any positive bag is truly positive, while every member of any negative\nbag is guaranteed to be negative. The goal of MIL is to induce the function that will accurately label\nindividual instances such as the ones within the training bags.\n\nMIL is well-suited for the following two image classi\ufb01cation scenarios:\n\n\u2022 Training images are labeled as to whether they contain the category of interest, but they also contain other\nobjects and background clutter. Every image is represented by a bag of regions, each of which is charac-\nterized by its color, texture, shape, etc. [12, 13]. For positive bags, at least one of the regions contains the\nobject of interest. The goal is to predict when new image regions contain the object\u2014that is, to learn to\nlabel regions as foreground or background.\n\n\u2022 The keyword associated with a category is used to download groups of images from multiple search engines\nin multiple languages. Each downloaded group is a bag, and the images within it are instances [11]. For\neach positive bag, at least one image actually contains the object of interest, while many others may be\nirrelevant. The goal is to predict the presence or absence of the category in new images.\n\nIn both cases, an instance-level decision is desirable, but bag-level labels are easier to obtain. While\nit has been established that MIL is valuable in such cases, previous methods do not consider how to\ndetermine what labels would be most bene\ufb01cial to obtain.\n\nWe integrate our active selection method with the SVM-based MIL approach given in [22], which\nuses a Normalized Set Kernel (NSK) to describe bags based on the average representation of in-\nstances within them. Following [23], we use the NSK mapping for positive bags only; all instances\nin a negative bag are treated individually as negative. We chose this classi\ufb01er since it performs\nwell in practice [24] and allows incremental updates [25]; further, by virtue of being a kernel-based\nalgorithm, it gives us \ufb02exibility in our choices of features and kernels. However, alternative MIL\ntechniques that provide probabilitistic outputs could easily be swapped in (e.g. [26, 24, 23]).\n\n\f3.2 Multi-Level Active Selection of Image Annotations\n\nGiven the current MIL classi\ufb01er, our objective is to select what annotation should be requested next.\nWhereas active selection criteria for traditional supervised classi\ufb01ers need only identify the best\ninstance to label next, in the MIL domain we have a more complex choice. There are three possible\ntypes of request: the system can ask for a label on an instance, a label on an unlabeled bag, or for\na joint labeling of all instances within a positive bag. So, we must design a selection criterion that\nsimultaneously determines which type of annotation to request, and for which example to request\nit. Adding to the challenge, the selection process must also account for the variable costs associated\nwith each level of annotation (e.g., it will take the annotator less time to detect whether the class of\ninterest is present or not, while a full segmentation will be more expensive).\n\nWe extend the value of information (VOI) strategy proposed in [18] to enable active MIL selection,\nand derive a generalized value function that can accept both instances and bags. This allows us to\npredict the information gain in a joint labeling of multiple instances at once, and thereby actively\nchoose when it is worthwhile to expend more or less manual effort in the training process. Our\nmethod continually re-evaluates the expected signi\ufb01cance of knowing more about any unlabeled or\npartially labeled example, as quanti\ufb01ed by the predicted reduction in misclassi\ufb01cation risk plus the\ncost of obtaining the label.\n\nWe consider a collection of unlabeled data XU , and labeled data XL composed of a set of positive\nbags Xp and a set of negative instances \u02dcXn. Recall that positively labeled bags contain instances\nwhose labels are unknown, since they contain an unknown mix of positive and negative instances.\nLet rp denote the user-speci\ufb01ed risk associated with misclassifying a positive example as negative,\nand rn denote the risk of misclassifying a negative. The risk associated with the labeled data is:\n\nRisk(XL) = X\n\nXi\u2208Xp\n\nrp(1 \u2212 p(Xi)) + X\n\nxi\u2208 \u02dcXn\n\nrnp(xi),\n\n(1)\n\nwhere xi denotes an instance and Xi denotes a bag. Here p(x) denotes the probability that a given\ninput is classi\ufb01ed as positive: p(x) = Pr(sgn(w\u03c6(x) + b) = +1|x) for the SVM hyperplane pa-\nrameters w and b. We compute these values using the mapping suggested in [27], which essentially\n\ufb01ts a sigmoid to map the SVM outputs to posterior probabilities. Note that here a positive bag Xi is\n\ufb01rst transformed according to the NSK before computing its probability. The corresponding risk for\nunlabeled data is:\n\nRisk(XU ) = X\n\nxi\u2208XU\n\nrp(1 \u2212 p(xi)) Pr(yi = +1|xi) + rnp(xi)(1 \u2212 Pr(yi = +1|xi)),\n\n(2)\n\nwhere yi is the true label for unlabeled example xi. The value of Pr(y = +1|x) is not directly\ncomputable for unlabeled data; following [18], we approximate it as Pr(y = +1|x) \u2248 p(x). This\nsimpli\ufb01es the risk for the unlabeled data to: Risk(XU ) = Pxi\u2208XU\n(rp + rn)(1 \u2212 p(xi))p(xi), where\nagain we transform unlabeled bags according to the NSK before computing the posterior.\n\nThe total cost T (XL, XU ) associated with the data is the total misclassi\ufb01cation risk, plus the cost of\nobtaining all labeled data thus far:\n\nT (XL, XU ) = Risk(XL) + Risk(XU ) + X\n\nXi\u2208Xp\n\nC(Xi) + X\n\nxi\u2208 \u02dcXn\n\nC(xi),\n\n(3)\n\nwhere the function C(\u00b7) returns the cost of obtaining an annotation for its input, and will be de\ufb01ned\nin more detail below.\n\nTo measure the expected utility of obtaining any particular new annotation, we want to predict\nthe change in total cost that would result from its addition to XL. Thus, the value of obtaining an\nannotation for input z is:\n\nV OI(z) = T (XL, XU ) \u2212 T (cid:16)XL \u222a z\n\n(t), XU r z(cid:17)\n\n(4)\n\n= Risk(XL) + Risk(XU ) \u2212 (cid:16)Risk (cid:16)XL \u222a z\n\n(t)(cid:17) + Risk (XU r z)(cid:17) \u2212 C(z),\n\nwhere z(t) denotes that the input z has been merged into the labeled set with its true label t, and\nXU r z denotes that it has been removed from the set of unlabeled data. If the VOI is high for a\n\n\fgiven input, then the total cost would be decreased by adding its annotation; similarly, low values\nindicate minor gains, and negative values indicate an annotation that costs more to obtain than it is\nworth. Thus at each iteration, the active learner surveys all remaining unlabeled and weakly labeled\nexamples, computes their VOI, and requests the label for the example with the maximal value.\n\nHowever, there are two important remaining technical issues. First, for this to be useful we must\nbe able to estimate the empirical risk for inputs before their labels are known. Secondly, for active\nselection to proceed at multiple levels, the VOI must act as an overloaded function: we need to be\nable to evaluate the VOI when z is an unlabeled instance or an unlabeled bag or a weakly labeled\nexample, i.e., a positive bag containing an unknown number of negative instances.\n\nTo estimate the total risk induced by incorporating a newly annotated example z into XL be-\nfore actually obtaining its true label t, we estimate the updated risk term with its expected value:\nRisk(XL \u222a z(t)) + Risk(XU r z) \u2248 E[Risk(XL \u222a z(t)) + Risk(XU r z)] = E, where E is short-\nhand for the expected value expression preceding it. If z is an unlabeled instance, then computing\nthe expectation is straightforward:\n(cid:16)Risk(XL \u222a z\n\n(l)) + Risk(XU r z)(cid:17) Pr(sgn(w\u03c6(z) + b) = l|z),\n\n(5)\n\nE = X\n\nl\u2208L\n\nwhere L = {+1, \u22121} is the set of all possible label assignments for z. The value Pr(sgn(w\u03c6(z) +\nb) = l|z) is obtained by evaluating the current classi\ufb01er on z and mapping the output to the associ-\nated posterior, and risk is computed based on the (temporarily) modi\ufb01ed classi\ufb01er with z(l) inserted\ninto the labeled set. Similarly, if z is an unlabeled bag, the label assignment can only be positive or\nnegative, and we compute the probability of either label via the NSK mapping.\n\nIf z is a positive bag containing M = |z| instances, however, there are 2M possible labelings: L =\n{+1, \u22121}M . For even moderately sized bags, this makes a direct computation of the expectation\nimpractical. Instead, we use Gibbs sampling to draw samples of the label assignment from the joint\ndistribution over the M instances\u2019 descriptors. Let z = {z1, . . . , zM } be the positive bag\u2019s instances,\nM )o denote the label assignment we wish to sample, with aj \u2208\nand let z(a) = n(z(a1)\n{+1, \u22121}. To sample from the conditional distribution of one instance\u2019s label given the rest\u2014the\nbasic procedure required by Gibbs sampling\u2014we re-train the MIL classi\ufb01er with the given labels\nadded, and then draw the remaining label according to aj \u223c Pr(sgn(w\u03c6(zj) + b) = +1|zj), where\nzj denotes the one instance currently under consideration. For positive bag z, the expected total risk\nis then the average risk computed over all S generated samples:\n\n), . . . , (z(aM )\n\n1\n\nE =\n\n1\nS\n\nS\n\nX\n\nk=1\n\n(cid:16)Risk({XL r z} \u222a {z(a1)k\n\n1\n\n, . . . , z(aM )k\n\nM\n\n}) + Risk(XU r {z1, z2, ..., zM })(cid:17) ,\n\n(6)\n\nwhere k indexes the S samples. To compute the risk on XL for each \ufb01xed sample we simply re-\nmove the weakly labeled positive bag z, and insert its instances as labeled positives and negatives,\nas dictated by the sample\u2019s label assignment. Computing the VOI values for all unlabeled data, espe-\ncially for the positive bags, requires repeatedly solving the classi\ufb01er objective function with slightly\ndifferent inputs; to make this manageable we employ incremental SVM updates [25].\n\nTo complete our active selection function, we must de\ufb01ne the cost function C(z), which maps an\ninput to the amount of effort required to annotate it. This function is problem-dependent. In the\nvisual categorization scenarios we have set forth, we de\ufb01ne the cost function in terms of the type of\nannotation required for the input z; we charge equal cost to label an instance or an unlabeled bag,\nand proportionally greater cost to label all instances in a positive bag, as determined empirically\nwith labeling experiments with human users. This re\ufb02ects that outlining an object contour is more\nexpensive than naming an object, or sorting through an entire page of Web search returns is more\nwork than labeling just one.\n\nWe can now actively select which examples and what type of annotation to request, so as to maxi-\nmize the expected bene\ufb01t to the category model relative to the manual effort expended. After each\nannotation is added and the classi\ufb01er is revised accordingly, the VOI is evaluated on the remaining\nunlabeled and weakly labeled data in order to choose the next annotation. This process repeats ei-\nther until the available amount of manual resources is exhausted, or, alternatively, until the maximum\nVOI is negative, indicating further annotations are not worth the effort.\n\n\f4 Results\n\nIn this section we demonstrate our approach to actively learn visual categories. We test with two\ndistinct publicly available datasets that illustrate the two learning scenarios above: (1) the SIVAL\ndataset1 of 25 objects in cluttered backgrounds, and (2) a Google dataset ([5]) of seven categories\ndownloaded from the Web. In both, the classi\ufb01cation task is to say whether each unseen image\ncontains the object of interest or not. We provide comparisons with single-level active learning (with\nboth the method of [21], and where the same VOI function is used but is restricted to actively label\nonly instances), as well as passive learning. For the passive baseline, we consider random selections\nfrom amongst both single-level and multi-level annotations, in order to verify that our approach does\nnot simply bene\ufb01t from having access to more informative possible labels. 2\n\nTo determine how much more labeling a positive bag costs relative to labeling an instance, we\nperformed user studies for both of the scenarios evaluated. For the \ufb01rst scenario, users were shown\noversegmented images and had to click on all the segments belonging to the object of interest. In the\nsecond, users were shown a page of downloaded Web images and had to click on only those images\ncontaining the object of interest. For both datasets, their baseline task was to provide a present/absent\n\ufb02ag on the images. For segmentation, obtaining labels on all positive segments took users on average\nfour times as much time as setting a \ufb02ag. For the Web images, it took 6.3 times as long to identify\nall positives within bags of 25 noisy images. Thus we set the cost of labeling a positive bag to 4 and\n6.3 for the SIVAL and Google data, respectively. These values agree with the average sparsity of the\ntwo datasets: the Google set contains about 30% true positive images while the SIVAL set contains\n10% positive segments per image. The users who took part in the experiment were untrained but still\nproduced consistent results.\n\n4.1 Actively Learning Visual Objects and their Foreground Regions from Cluttered Images\nThe SIVAL dataset [21] contains 1500 images, each labeled with one of 25 class labels. The clut-\ntered images contain objects in a variety of positions, orientations, locations, and lighting conditions.\nThe images have been oversegmented into about 30 regions (instances) each, each of which is rep-\nresented by a 30-d feature describing its color and texture. Thus each image is a bag containing both\npositive and negative instances (segments). Labels on the training data specify whether the object of\ninterest is present or not, but the segments themselves are unlabeled (though the dataset does provide\nground truth segment labels for evaluation purposes).\n\nThe initial training set is comprised of 10 positive and 10 negative images per class, selected at\nrandom. Our active learning method must choose its queries from among 10 positive bags (com-\nplete segmentations), 300 unlabeled instances (individual segments), and about 150 unlabeled bags\n(present/absent \ufb02ag on the image). We use a quadratic kernel with a coef\ufb01cient of 10\u22126, and average\nresults over \ufb01ve random training partitions.\n\nFigure 2(a) shows representative (best and worst) learning curves for our method and the three\nbaselines, all of which use the same MIL classi\ufb01er (NSK-SVM). Note that the curves are plotted\nagainst the cumulative cost of obtaining labels\u2014as opposed to the number of queried instances\u2014\nsince our algorithm may choose a sequence of queries with non-uniform cost. All methods are given\na \ufb01xed amount of manual effort (40 cost units) and are allowed to make a sequence of choices until\nthat cost is used up. Recall that a cost of 40 could correspond, for example, to obtaining labels on\n40\n1 = 40 instances or 40\n4 = 10 positive bags, or some mixture thereof. Figure 2(b) summarizes\nthe learning curves for all categories, in terms of the average improvement at a \ufb01xed point midway\nthrough the active learning phase.\n\nAll four methods steadily improve upon the initial classi\ufb01er, but at different rates with respect to the\ncost. (All methods fail to do better than chance on the \u2018dirty glove\u2019 class, which we attribute to the\nlack of distinctive texture or color on that object.) In general, a steeper learning curve indicates that\na method is learning most effectively from the supplied labels. Our multi-level approach shows the\nmost signi\ufb01cant gains at a lower cost, meaning that it is best suited for building accurate classi\ufb01ers\nwith minimal manual effort on this dataset. As we would expect, single-level active selections are\nbetter than random, but still fall short of our multi-level approach. This is because single-level active\nselection can only make a sequence of greedy choices while our approach can jointly select bags of\ninstances to query. Interestingly, multi- and single-level random selections perform quite similarly\n\n1 http://www.cs.wustl.edu/accio/\n2 See [28] for further implementation details, image examples, and learning curves on all classes.\n\n\fC\nO\nR\n\n \nr\ne\nd\nn\nu\n\n \n\na\ne\nr\nA\n\n102\n\n100\n\n98\n\n96\n\n94\n\n92\n\n90\n\n88\n\n \n\n0\n\nCategory \u2212 ajaxorange\n\n \n\nMulti\u2212level active\nSingle\u2212level active\nMulti\u2212level random\nSingle\u2212level random\n\n85\n\n80\n\n75\n\n70\n\n65\n\n60\n\nC\nO\nR\n\n \nr\ne\nd\nn\nu\n\n \n\na\ne\nr\nA\n\nCategory \u2212 apple\n\nCategory \u2212 dirtyworkgloves\n\n \n\nMulti\u2212level active\nSingle\u2212level active\nMulti\u2212level random\nSingle\u2212level random\n\n \n\nMulti\u2212level active\nSingle\u2212level active\nMulti\u2212level random\nSingle\u2212level random\n\n49\n\n48\n\n47\n\n46\n\n45\n\n44\n\n43\n\n42\n\nC\nO\nR\n\n \nr\ne\nd\nn\nu\n\n \n\na\ne\nr\nA\n\n0\n2\n=\n\n \n\n \nt\ns\no\nc\n \nt\n\n \n\na\nC\nO\nR\nU\nA\nn\n\n \n\ni\n \nt\n\nn\ne\nm\ne\nv\no\nr\np\nm\n\n10\n\n20\n\nCost\n\n30\n\n40\n\n \n\n55\n0\n\n10\n\n20\n\nCost\n\n30\n\n40\n\n \n\n41\n0\n\n10\n\n20\n\nCost\n\n30\n\n40\n\nI\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\nMulti\u2212level\n   active\n\nSingle\u2212level\n   active\n\nMulti\u2212level\n   random\n\nSingle\u2212level\n   random\n\n(a) Example learning curves per class\n\n(b) Summary: all classes\n\nFig. 2. Results on the SIVAL dataset. (a) Sample learning curves per class, each averaged over \ufb01ve trials. First\ntwo are best examples, last is worst. (b) Summary of the average improvement over all categories after half\nof the annotation cost is used. For the same amount of annotation cost, our multi-level approach learns more\nquickly than both traditional single-level active selection as well as both forms of random selection.\nSIVAL dataset\n\nCost\n\nOur Approach\n\nMI Logistic Regression [21]\n\nRandom Multi-level Gain over Random MIU Gain over\nActive Random%\n\nActive Random %\n\n10 +0.0051 +0.0241\n20 +0.0130 +0.0360\n50 +0.0274 +0.0495\n\n372\n176\n81\n\n+0.023 +0.050\n+0.033 +0.070\n+0.057 +0.087\n\n117\n112\n52\n\nl\n\ns\ne\nb\na\nl\n \nf\no\n \nr\ne\nb\nm\nu\nn\n \ne\nv\ni\nt\na\nu\nm\nu\nC\n\nl\n\nunlabeled instances\nunlabeled bags\npositive bags\n(all instances)\n\n \n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\ne\np\ny\nt\n \nr\ne\np\n \nd\ne\nr\ni\nu\nq\nc\na\n\n \n\n0\n\n2\n\n4\n\n6\n\nTimeline\n\n8\n\n10\n\nFig. 3. Left: Comparison with [21] on the SIVAL data, as measured by the average improvement in the AUROC\nover the initial model for increasing labeling cost values. Right: The cumulative number of labels acquired for\neach type with increasing number of queries. Our method tends to request complete segmentations or image\nlabels early on, followed by queries on unlabeled segments later on.\n\non this dataset (see boxplots in (b)), which indicates that having more informative labels alone does\nnot directly lead to better classi\ufb01ers unless the right instances are queried.\n\nThe table in Figure 3 compares our results to those reported in [21], in which the authors train an\ninitial classi\ufb01er with multiple-instance logistic regression, and then use the MI Uncertainty (MIU) to\nactively choose instances to label. Following [21], we report the average gains in the AUROC over\nall categories at \ufb01xed points on the learning curve, averaging results over 20 trials and with the same\ninitial training set of 20 positive and negative images. Since the accuracy of the base classi\ufb01ers used\nby the two methods varies, it is dif\ufb01cult to directly compare the gains in the AUROC. The NSK-\nSVM we use consistently outperforms the logistic regression approach using only the initial training\nset; even before active learning our average accuracy is 68.84, compared to 52.21 in [21]. There-\nfore, to aid in comparison, we also report the percentage gain relative to random selection, for both\nclassi\ufb01ers. The results show that our approach yields much stronger relative improvements, again\nillustrating the value of allowing active choices at multiple levels. For both methods, the percent\ngains decrease with increasing cost; this makes sense, since eventually (for enough manual effort) a\npassive learner can begin to catch up to an active learner.\n\n4.2 Actively Learning Visual Categories from Web Images\n\nNext we evaluate the scenario where each positive bag is a collection of images, among which only\na portion are actually positive instances for the class of interest. Bags are formed from the Google-\ndownloaded images provided in [5]. This set contains on average 600 examples for each of the seven\ncategories. Naturally, the number of true positives for each class are sparse: on average 30% contain\na \u201cgood\u201d view of the class of interest, 20% are of \u201cok\u201d quality (occlusions, noise, cartoons, etc.), and\n50% are \u201cjunk\u201d. Previous methods have shown how to learn from noisy Web images, with results\nrivaling state-of-the-art supervised techniques [11, 5, 6]. We show how to boost accuracy with these\ntypes of learners while leveraging minimal manual annotation effort.\n\nTo re-use the publicly available dataset from [5], we randomly group Google images into bags of\nsize 25 to simulate multiple searches as in [11], yielding about 30 bags per category. We randomly\nselect 10 positive and 10 negative bags (from all other categories) to serve as the initial training data\nfor each class. The rest of the positive bags of a class are used to construct the test sets. All results\nare averaged over \ufb01ve random partitions. We represent each image as a bag of \u201cvisual words\u201d, and\ncompare examples with a linear kernel. Our method makes active queries among 10 positive bags\n(complete labels) and about 250 unlabeled instances (images). There are no unlabeled bags in this\nscenario, since every downloaded batch is associated with a keyword.\n\n\f70\n\n65\n\n60\n\nC\nO\nR\n\n \nr\ne\nd\nn\nu\n \na\ne\nr\nA\n\n55\n0\n\n \n\nCategory \u2212 cars rear\n\nCategory \u2212 guitar\n\nCategory \u2212 motorbike\n\n \n\nMulti\u2212level active\nSingle\u2212level active\nMulti\u2212level random\nSingle\u2212level random\n\nC\nO\nR\n\n \nr\ne\nd\nn\nu\n \na\ne\nr\nA\n\n60\n\n55\n\n50\n\n \n\nMulti\u2212level active\nSingle\u2212level active\nMulti\u2212level random\nSingle\u2212level random\n\nC\nO\nR\n\n \nr\ne\nd\nn\nu\n \na\ne\nr\nA\n\n72\n\n70\n\n68\n\n66\n\n64\n\n62\n\n60\n\n \n\nMulti\u2212level active\nSingle\u2212level active\nMulti\u2212level random\nSingle\u2212level random\n\n10\n\n20\n\nCost\n\n30\n\n40\n\n \n\n0\n\n10\n\n20\n\nCost\n\n30\n\n40\n\n \n\n0\n\n10\n\n20\n\nCost\n\n30\n\n40\n\n(a) Example learning curves per class\n\n0\n2\n \nt\ns\no\nc\n \nt\na\n \nC\nO\nR\nU\nA\n \nn\ni\n \nt\nn\ne\nm\nv\no\nr\np\nm\n\nI\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n\nSingle\u2212level\n   active   \n\nMulti\u2212level\n   active  \n(b) Summary: all classes\n\nMulti\u2212level\n   random  \n\nSingle\u2212level\n   random   \n\nFig. 4. Results on the Google dataset, in the same format as Figure 2. Our multi-level active approach outper-\nforms both random selection strategies and traditional single-level active selection.\n\nFigure 4 shows the learning curves and a summary of our active learner\u2019s performance. Our multi-\nlevel approach again shows more signi\ufb01cant gains at a lower cost relative to all baselines, improving\naccuracy with as few as ten labeled instances. On this dataset, random selection with multi-level\nannotations actually outperforms random selection on single-level annotations (see the boxplots).\nWe attribute this to the distribution of bags/instances: on average more positive bags were randomly\nchosen, and each addition led to a larger increase in the AUROC.\n\n5 Conclusions and Future Work\nOur approach addresses a new problem: how to actively choose not only which instance to label, but\nalso what type of image annotation to acquire in a cost-effective way. Our method is general enough\nto accept other types of annotations or classi\ufb01ers, as long as the cost and risk functions can be appro-\npriately de\ufb01ned. Comparisons with passive learning methods and single-level active learning show\nthat our multi-level method is better-suited for building classi\ufb01ers with minimal human intervention.\nIn future work, we will consider look-ahead scenarios with more far-sighted choices. We are also\npursuing ways to alleviate the VOI computation cost, which as implemented involves processing all\nunlabeled data prior to making a decision. Finally, we hope to incorporate our approach within an\nexisting system with many real users, like Labelme [8].\nReferences\n\n[1] Weber, M., Welling, M., Perona, P.: Unsupervised Learning of Models for Recognition. In: ECCV. (2000)\n[2] Sivic, J., Russell, B., Efros, A., Zisserman, A., Freeman, W.: Discovering Object Categories in Image Collections. In: ICCV. (2005)\n[3] Quelhas, P., Monay, F., Odobez, J.M., Gatica-Perez, D., Tuytelaars, T., VanGool, L.: Modeling Scenes with Local Descriptors and Latent\n\nAspects. In: ICCV. (2005)\n\n[4] Bart, E., Ullman, S.: Cross-Generalization: Learning Novel Classes from a Single Example by Feature Replacement. In: CVPR. (2005)\n[5] Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning Object Categories from Google\u2019s Image Search. In: ICCV. (2005)\n[6] Li, L., Wang, G., Fei-Fei, L.: Optimol: Automatic Online Picture Collection via Incremental Model Learning. In: CVPR. (2007)\n[7] von Ahn, L., Dabbish, L.: Labeling Images with a Computer Game. In: CHI. (2004)\n[8] Russell, B., Torralba, A., Murphy, K., Freeman, W.: Labelme: a Database and Web-Based Tool for Image Annotation. TR, MIT (2005)\n[9] Dietterich, T., Lathrop, R., Lozano-Perez, T.: Solving the Multiple Instance Problem with Axis-Parallel Rectangles. Arti\ufb01cial Intelligence\n\n89 (1997) 31\u201371\n\n[10] Murphy, K., Torralba, A., Freeman, W.: Using the Forest to See the Trees:a Graphical Model Relating Features, Objects and Scenes. In:\n\nNIPS. (2003)\n\n[11] Vijayanarasimhan, S., Grauman, K.: Keywords to Visual Categories: Multiple-Instance Learning for Weakly Supervised Object Catego-\n\nrization. In: CVPR. (2008)\n\n[12] Maron, O., Ratan, A.: Multiple-Instance Learning for Natural Scene Classi\ufb01cation. In: ICML. (1998)\n[13] Yang, C., Lozano-Perez, T.: Image Database Retrieval with Multiple-Instance Learning Techniques. In: ICDE. (2000)\n[14] Viola, P., Platt, J., Zhang, C.: Multiple Instance Boosting for Object Detection. In: NIPS. (2005)\n[15] Freund, Y., Seung, H., Shamir, E., Tishby: Selective Sampling Using the Query by Committee Algorithm. Machine Learning 28 (1997)\n[16] Tong, S., Koller, D.: Support Vector Machine Active Learning with Applications to Text Classi\ufb01cation. In: ICML. (2000)\n[17] Lindenbaum, M., Markovitch, S., Rusakov, D.: Selective Sampling for Nearest Neighbor Classi\ufb01ers. Machine Learning 54 (2004)\n[18] Kapoor, A., Horvitz, E., Basu, S.: Selective Supervision: Guiding Supervised Learning with Decision-Theoretic Active Learning. In:\n\nIJCAI. (2007)\n\n[19] Kapoor, A., Grauman, K., Urtasun, R., Darrell, T.: Active Learning with Gaussian Processes for Object Categorization. In: ICCV. (2007)\n[20] Yan, R., Yang, J., Hauptmann, A.: Automatically Labeling Video Data using Multi-Class Active Learning. In: ICCV. (2003)\n[21] Settles, B., Craven, M., Ray, S.: Multiple-Instance Active Learning. In: NIPS. (2008)\n[22] Gartner, T., Flach, P., Kowalczyk, A., Smola, A.: Multi-Instance Kernels. In: ICML. (2002)\n[23] Bunescu, R., Mooney, R.: Multiple Instance Learning for Sparse Positive Bags. In: ICML. (2007)\n[24] Ray, S., Craven, M.: Supervised v. Multiple Instance Learning: An Empirical Comparison. In: ICML. (2005)\n[25] Cauwenberghs, G., Poggio, T.: Incremental and Decremental Support Vector Machine Learning. In: NIPS. (2000)\n[26] Andrews, S., Tsochantaridis, I., Hofmann, T.: Support Vector Machines for Multiple-Instance Learning. In: NIPS. (2002)\n[27] Platt, J.: Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In: Advances in\n\nLarge Margin Classi\ufb01ers. MIT Press (1999)\n\n[28] Vijayanarasimhan, S., Grauman, K.: Multi-level Active Prediction of Useful Image Annotations for Recognition. Technical Report\n\nUT-AI-TR-08-2, University of Texas at Austin (2008)\n\n\f", "award": [], "sourceid": 774, "authors": [{"given_name": "Sudheendra", "family_name": "Vijayanarasimhan", "institution": null}, {"given_name": "Kristen", "family_name": "Grauman", "institution": null}]}