{"title": "A Framework for Multiple-Instance Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 570, "page_last": 576, "abstract": "", "full_text": "A Framework for Multiple-Instance Learning \n\nOded Maron \n\nNE43-755 \n\nAI Lab, M.I. T. \n\nCambridge, MA 02139 \n\noded@ai.mit.edu \n\nTomas Lozano-Perez \n\nNE43-836a \nAI Lab, M.I.T. \n\nCambridge, MA 02139 \n\ntlp@ai.mit.edu \n\nAbstract \n\nMultiple-instance learning is a variation on supervised learning, where the \ntask is to learn a concept given positive and negative bags of instances. \nEach bag may contain many instances, but a bag is labeled positive even \nif only one of the instances in it falls within the concept. A bag is labeled \nnegative only if all the instances in it are negative. We describe a new \ngeneral framework, called Diverse Density, for solving multiple-instance \nlearning problems. We apply this framework to learn a simple description \nof a person from a series of images (bags) containing that person, to a stock \nselection problem, and to the drug activity prediction problem. \n\n1 Introduction \n\nOne ofthe drawbacks of applying the supervised learning model is that it is not always possible \nfor a teacher to provide labeled examples for training. Multiple-instance learning provides a \nnew way of modeling the teacher's weakness. Instead of receiving a set of instances which \nare labeled positive or negative, the learner receives a set of bags that are labeled positive or \nnegative. Each bag contains many instances. A bag is labeled negative if all the instances in \nit are negative. On the other hand, a bag is labeled positive if there is at least one instance in it \nwhich is positive. From a collection of labeled bags, the learner tries to induce a concept that \nwill label individual instances correctly. This problem is harder than even noisy supervised \nlearning since the ratio of negative to positive instances in a positively-labeled bag (the noise \nratio) can be arbitrarily high. \n\nThe first application of multiple-instance learning was to drug activity prediction. In the \nactivity prediction application, one objective is to predict whether a candidate drug molecule \nwill bind strongly to a target protein known to be involved in some disease state. Typically, \n\n\fA Frameworkfor Multiple-Instance Learning \n\n571 \n\none has examples of molecules that bind well to the target protein and also of molecules that \ndo not bind well. Much as in a lock and key, shape is the most important factor in determining \nwhether a drug molecule and the target protein will bind. However, drug molecules are \nflexible, so they can adopt a wide range of shapes. A positive example does not convey what \nshape the molecule took in order to bind - only that one of the shapes that the molecule can \n~ake was the right one. However, a negative example means that none of the shapes that the \nmolecule can achieve was the right key. \n\nThe multiple-instance learning model was only recently formalized by [Dietterich et ai., 1997]. \nThey assume a hypothesis class of axis-parallel rectangles, and develop algorithms for dealing \nwith the drug activity prediction problem described above. This work was followed by [Long \nand Tan, 1996], where a high-degree polynomial PAC bound was given for the number of \nexamples needed to learn in the multiple-instance learning model. [Auer, 1997] gives a more \nefficient algorithm, and [Blum and Kalai, 1998] shows that learning from multiple-instance \nexamples is reducible to PAC-learning with two sided noise and to the Statistical Query model. \nUnfortunately, the last three papers make the restrictive assumption that all instances from all \nbags are generated independently. \n\nIn this paper, we describe a framework called Diverse Density for solving multiple-instance \nproblems. Diverse Density is a measure of the intersection of the positive bags minus the union \nof the negative bags. By maximizing Diverse Density we can find the point of intersection \n(the desired concept), and also the set of feature weights that lead to the best intersection. \nWe show results of applying this algorithm to a difficult synthetic training set as well as the \n\"musk\" data set from [Dietterich et ai., 1997]. We then use Diverse Density in two novel \napplications: one is to learn a simple description of a person from a series of images that are \nlabeled positive if the person is somewhere in the image and negative otherwise. The other is \nto deal with a high amount of noise in a stock selection problem. \n\n2 Diverse Density \n\nWe motivate the idea of Diverse Density through a molecular example. Suppose that the \nshape of a candidate molecule can be adequately described by a feature vector. One instance \nof the molecule is therefore represented as a point in n-dimensional feature space. As the \nmolecule changes its shape (through both rigid and non-rigid transformations), it will trace out \na manifold through this n-dimensional space l . Figure l(a) shows the paths of four molecules \nthrough a 2-dimensional feature space. \n\nIf a candidate molecule is labeled positive, we know that in at least one place along the \nmanifold, it took on the right shape for it to fit into the target protein. If the molecule is labeled \nnegative, we know that none of the conformations along its manifold will allow binding with \nthe target protein. If we assume that there is only one shape that will bind to the target protein, \nwhat do the positive and negative manifolds tell us about the location of the correct shape \nin feature space? The answer: it is where all positive feature-manifolds intersect without \nintersecting any negative feature-manifolds. For example, in Figure lea) it is point A. \n\nUnfortunately, a multiple-instance bag does not give us complete distribution information, \nbut only some arbitrary sample from that distribution. In fact, in applications other than \ndrug discovery, there is not even a notion of an underlying continuous manifold. Therefore, \nFigure l(a) becomes Figure \nl(b). The problem of trying to find an intersection changes \n\nI In practice, one needs to restrict consideration to shapes of the molecule that have sufficiently low \n\npotential energy. But, we ignore this restriction in this simple illustration. \n\n\f572 \n\nO. Maron and T. Lozano-Perez \n\nposttlVe \n~g.1 \nx \n\nx \n\nnegattv. bag \n0 \n\no \n\npositive \nbag 112 \n\nX \nX \n\npoint A \n\nCl ~A A \n\nCl ~Cl Cl \n\nCl \n\n~Itlv. \nbag #13 \n\n(a) \n\nA \n\nA \n\nx \nx \n\nX \n\nCl \n\nCl \n\nA \nA \nA \n\npositive \nbag 113 \n\nXx \n\nCl \n\no \n\nX \n\n(b) \n\no \n\no \n\nThe different shapes that a molecule can \ntake on are represented as a path. The inter(cid:173)\nsection point of positive paths is where they \ntook on the same shape. \n\nSamples taken along the paths. Section B \nis a high density area, but point A is a high \nDiverse Density area. \n\nFigure 1: A motivating example for Diverse Density \n\nto a problem of trying to find an area where there is both high density of positive points \nand low density of negative points. The difficulty with using regular density is illustrated in \nFigure 1 (b), Section B. We are not just looking for high density, but high \"Diverse Density\". \nWe define Diverse Density at a point to be a measure of how many different positive bags have \ninstances near that point, and how far the negative instances are from that point. \n\n2.1 Algorithms for multiple-instance learning \n\nIn this section, we derive a probabilistic measure of Diverse Density, and test it on a difficult \nartificial data set. We denote positive bags as Bt, the ph point in that bag as Bt, and the \nvalue of the kth feature of that point as Bt k ' Likewise, BiJ represents a negative point. \nAssuming for now that the true concept is a single point t, we can find it by maximizing \nPr(x = t I Bt, ... , B;;, B], ... , B;) over all points x in feature space. If we use Bayes' \nrule and an uninformative prior over the concept location, this is equivalent to maximizing \nthe likelihood Pr( Bt , . .. , B;;, B] , ... ,B; I x = t). By making the additional assumption \nthat the bags are conditionally independent given the target concept t, the best hypothesis is \nargmax x TIi Pr(Bt I x = t) TIi Pr(B; I x = t). Using Bayes' rule once more (and again \nassuming a uniform prior over concept location), this is equivalent to \n\nargm:x II Pr(x = t I Bn II Pr(x = t I B i-) \u00b7 \n\ni \n\ni \n\n(1) \n\nThis is a general definition of maximum Diverse Density, but we need to define the terms in the \nproducts to instantiate it. One possibility is a noisy-or model: the probability that not all points \nmissed the target is Pr(x = t I Bt) = Pr(x = t I Bi1, B:!i, . .. ) = 1- TIj(I-Pr(x = tIBt\u00bb, \nand likewise Pre x = t I Bj- ) = TIj (1 - Pr( x = t I Bij\u00bb. We model the causal probability of \nan individual instance on a potential target as related to the distance between them. Namely, \nPre x = t I Bij) = exp( -\nII Bij - x 11 2 ). Intuitively, if one of the instances in a positive bag \nis close to x, then Pre x = t I Bt) is high. Likewise, if every positive bag has an instance \nclose to x and no negative bags are close to x, then x will have high Diverse Density. Diverse \nDensity at an intersection of n bags is exponentially higher than it is at an intersection of n - 1 \nbags, yet all it takes is one well placed negative instance to drive the Diverse Density down. \n\n\fA Frameworkfor Multiple-Instance Learning \n\n573 \n\nFigure 2: Negative and positive bags drawn from the same distribution, but labeled according \nto their intersection with the middle square. Negative instances are dots, positive are numbers. \nThe square contains at least one instance from every positive bag and no negatives, \n\nThe Euclidean distance metric used to measure \"closeness\" depends on the features that \ndescribe the instances. It is likely that some of the features are irrelevant, or that some should \nbe weighted to be more important than others. Luckily, we can use the same framework to \nfind not only the best location in feature space, but also the best weighting of the features. \nOnce again, we find the best scaling of the individual features by finding the scalings that \nmaximize Diverse Density. The algorithm returns both a location x and a scaling vector s, \nwhere 1/ Bij - X W= Lk ShBijk - Xk)2 . \nNote that the assumption that all bags intersect at a single point is not necessary. We can \nassume more complicated concepts, such as for example a disjunctive concept ta V to . In this \ncase, we maximize over a pair of locations Xa and Xo and define Pr(xa = ta V Xb = to I \nBij) = maXXa ,Xb(Pr(Xa = ta I Bij ), Pr(xo = to I Bij )). \nTo test the algorithm, we created an artificial data set: 5 positive and 5 negative bags, each with \n50 instances. Each instance was chosen uniformly at randomly from a [0 , 100] x [0, 100] E n2 \ndomain, and the concept was a 5 x 5 square in the middle of the domain. A bag was labeled \npositive if at least one of its instances fell within the square, and negative if none did, as shown \nin Figure 2. The square in the middle contains at least one instance from every positive bag \nand no negative instances. This is a difficult data set because both positive and negative bags \nare drawn from the same distribution. They only differ in a small area of the domain. \n\nUsing regular density (adding up the contribution of every positive bag and subtracting negative \nbags; this is roughly what a supervised learning algorithm such as n~arest neighbor performs), \nwe can plot the density surface across the domain. Figure 3(a) shows this surface for the \ndata set in Figure 2, and it is clear that finding the peak (a candidate hypothesis) is difficult. \nHowever, when we plotthe Diverse Density surface (using the noisy-or model) in Figure 3(b), \nit is easy to pick out the global maximum which is within the desired concept. The other \n\n\f574 \n\nO. Maron and T. Ulzano-Perez \n\n(a) Surface using regular density \n\n(b) Surface using Diverse Density \n\nFigure 3: Density surfaces over the example data of Figure 3 \n\nmajor peaks in Figure 3(b) are the result of a chance concentration of instances from different \nbags. With a bit more bad luck, one of those peaks could have eclipsed the one in the middle. \nHowever, the chance of this decreases as the number of bags (training examples) increases. \n\nOne remaining issue is how to find the maximum Diverse Density. In general, we are searching \nan arbitrary density landscape and the number of local maxima and size of the search space \ncould prohibit any efficient exploration. In this paper, we use gradient ascent with multiple \nstarting points. This has worked succesfully in every test case because we know what starting \npoints to use. Th'e maximum Diverse Density peak is made of contributions from some set \nof positive points. If we start an ascent from every positive point, one of them is likely to \nbe closest to the maximum, contribute the most to it and have a climb directly to it. While \nthis heuristic is sensible for maximizing with respect to location, maximizing with respect to \nscaling of feature weights may still lead to local maxima. \n\n3 Applications of Diverse Density \n\nBy way of benchmarking, we tested the Diverse Density approach on the \"musk\" data sets \nfrom [Dietterich et ai., 1997], which were also used in [Auer, 1997]. We also have begun \ninvestigating two new applications of multiple-instance learning. We describe preliminary \nresults on all of these below. The musk data sets contain feature vectors describing the surfaces \nof a variety of low-energy shapes from approximately 100 molecules. Each feature vector \nhas 166 dimensions. Approximately half ofthese molecules are known to smell \"musky,\" the \nremainder are very similar molecules that do not smell musky. There are two musk data sets; \nthe Musk-l data set is smaller, both in having fewer molecules and many fewer instances per \nmolecule. Many (72) of the molecules are shared between the two data sets, but the second \nset includes more instances for the shared molecules. \n\nWe approached the problem as follows: for each run, we held out a randomly selected \n1/10 of the data set as a test set. We computed the maximum Diverse Density on the \ntraining set by multiple gradient ascents, starting at each positi ve instance. This produces a \n\n\fA Frameworkfor Multiple-Instance Learning \n\n575 \n\nmaximum feature point as well as the best feature weights corresponding to that point. We \nnote that typically less than half of the 166 features receive non-zero weighting. We then \ncomputed a distance threshold that optimized classification performance under leave-one-out \ncross validation within the training set. We used the feature weights and distance threshold to \nclassify the examples of the test set; an example was deemed positive if the weighted distance \nfrom the maximum density point to any of its instances was below the threshold. \n\nThe table below lists the average accuracy of twenty runs, compared with the performance \nof the two principal algorithms reported in [Dietterich et aI., 1997] (i tera ted-discrim \nAPR and GFS elim-kde APR), as well as the MULTINST algorithm from [Auer, 1997l. \nWe note that the performances reported for i terated-discrim APR involves choosing \nparameters to maximize test set performance and so probably represents an upper bound for \naccuracy on this data set. The MULTINST algorithm assumes that all instances from all \nbags are generated independently. The Diverse Density results, which required no tuning, are \ncomparable or better than those ofGFS elim-kde APR and MULTINST. \n\nMusk Data Set 1 \n\nalgorithm \niterated-discrim APR \nGFS elim-kde APR \nDiverse Density \nMULTINST \n\naccuracy \n\n92.4 \n91.3 \n88.9 \n76.7 \n\nMusk Data Set 2 \n\nalgorithm \niterated-discrim APR \nMULTINST \nDi verse Densi ty \nGFS elim-kde APR \n\naccuracy \n\n89.2 \n84.0 \n82.5 \n80.4 \n\nWe also investigated two new applications of multiple-instance learning. The first of these is \nto learn a simple description of a person from a series of images that are labeled positive if \nthey contain the person and negative otherwise. For a positively labeled image we only know \nthat the person is somewhere in it, but we do not know where. We sample 54 subimages of \nvarying centers and sizes and declare them to be instances in one positive bag since one of \nthem contains the person. This is repeated for every positive and negative image. \n\nWe use a very simple representation for the instances. Each subimage is divided into three parts \nwhich roughly correspond to where the head, torso and legs of the person would be. The three \ndominant colors (one for each subsection) are used to represent the image. Figure 4 shows \na training set where every bag included two people, yet the algorithm learned a description \nof the person who appears in all the images. This technique is expanded in [Maron and \nLakshmiRatan, 1998] to learn descriptions of natural images and use the learned concept to \nretrieve similar images from a large image database. \n\nAnother new application uses Diverse Density in the stock selection problem. Every month, \nthere are stocks that perform well for fundamental reasons and stocks that perform well because \nof flukes; there are many more of the latter, but we are interested in the former. For every \nmonth, we take the 100 stocks with the highest return and put them in a positive bag, hoping \nthat at least one of them did well for fundamental reasons. Negative bags are created from \nthe bottom 5 stocks in every month. A stock instance is described by 17 features such as \nmomentum, price to fair-value, etc. Grantham, Mayo, Van Otterloo & Co. kindly provided \nus with data on the 600 largest US stocks since 1978. We tested the algorithm through five \nruns of training for ten years, then testing on the next year. In each run, the algorithm returned \nthe stock description (location in feature space and a scaling of the features) that maximized \nDiverse Density. The test stocks were then ranked and decilized by distance (in weighted \nfeature space) to the max-DD point. Figure 5 shows the average return of every decile. The \nreturn in the top decile (stocks that are most like the \"fundamental stock\") is positive and \n\n\f576 \n\nO. Maron and T. Lozano-Perez \n\nreturn \n\na. 1 \n\n\u00b70. 2 \n\nn \u2022 \n6 f 8 \n\n5 \n\n4 \n\n9 \n\n10 \n\ndeclle \n\nperson in common. \n\n\u00b70. 4 \nFigure 6: Black bars show Diverse Den(cid:173)\nsity's average return on a decile, and the \nwhite bars show GMO's predictor's return. \n\nhigher than the average return of a GMO predictor. Likewise, the return in the bottom decile \nis negative and below that of a GMO predictor. \n\n4 Conclusion \n\nIn this paper, we have shown that Diverse Density is a general tool with which to learn from \nMultiple-Instance examples. In addition, we have shown that Multiple-Instance problems \noccur in a wide variety of domains. We attempted to show the various ways in which \nambiguity can lead to the Multiple-Instance framework: through lack of knowledge in the \ndrug discovery .example, through ambiguity of representation in the vision example, and \nthrough a high degree of noise in the stock example. \n\nAcknowledgements \nWe thank Peter Dayan and Paul Viola at MIT and Tom Hancock and Chris Darnell at GMO \nfor helpful discussions and the AFOSR ASSERT program, Parent Grant#:F49620-93-1-0263 \nfor their support of this research. \nReferences \n\n[Auer, 1997] P. Auer. On Learning from Multi-Instance Examples: Empirical Evaluation of a \ntheoretical Approach. NeuroCOLT Technical Report Series, NC-TR-97-025, March 1997. \n\n[Blum and Kalai, 1998] A. Blum and A. Kalai. A Note on Learning from Multiple-Instance \n\nExamples. To appear in Machine Learning, 1998. \n\n[Dietterich et aI., 1997] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez. Solving the \nMultiple-Instance Problem with Axis-Parallel Rectangles. Artificial Intelligence Journal, \n89, 1997. \n\n[Long and Tan, 1996] P. M. Long and L. Tan. PAC-learning axis alligned rectangles with \nrespect to product distributions from multiple-instance examples. In Proceedings of the \nJ 996 Conference on Computational Learning Theory, 1996. \n\n[Maron and LakshmiRatan, 1998] O. Maron and A. LakshmiRatan. Multiple-Instance Learn(cid:173)\n\ning for Natural Scene Classification. In Submitted to CVPR-98, 1998. \n\n\f", "award": [], "sourceid": 1346, "authors": [{"given_name": "Oded", "family_name": "Maron", "institution": null}, {"given_name": "Tom\u00e1s", "family_name": "Lozano-P\u00e9rez", "institution": null}]}