{"title": "Support Vector Machines for Multiple-Instance Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 577, "page_last": 584, "abstract": null, "full_text": "Support Vector Machines for \nMulti ple-Instance Learning \n\nStuart Andrews, Ioannis Tsochantaridis and Thomas Hofmann \nDepartment of Computer Science, Brown University, Providence, RI 02912 \n\n{stu,it,th}@cs.brown.edu \n\nAbstract \n\nThis paper presents two new formulations of multiple-instance \nlearning as a maximum margin problem. The proposed extensions \nof the Support Vector Machine (SVM) learning approach lead to \nmixed integer quadratic programs that can be solved heuristically. \nOur generalization of SVMs makes a state-of-the-art classification \ntechnique, including non-linear classification via kernels, available \nto an area that up to now has been largely dominated by special \npurpose methods. We present experimental results on a pharma(cid:173)\nceutical data set and on applications in automated image indexing \nand document categorization. \n\n1 \n\nIntroduction \n\nMultiple-instance learning (MIL) [4] is a generalization of supervised classification \nin which training class labels are associated with sets of patterns, or bags, instead of \nindividual patterns. While every pattern may possess an associated true label, it is \nassumed that pattern labels are only indirectly accessible through labels attached to \nbags. The law of inheritance is such that a set receives a particular label, if at least \none of the patterns in the set possesses the label. In the important case of binary \nclassification, this implies that a bag is \"positive\" if at least one of its member \npatterns is a positive example. MIL differs from the general set-learning problem in \nthat the set-level classifier is by design induced by a pattern-level classifier. Hence \nthe key challenge in MIL is to cope with the ambiguity of not knowing which of the \npatterns in a positive bag are the actual positive examples and which ones are not. \n\nThe MIL setting has numerous interesting applications. One prominent applica(cid:173)\ntion is the classification of molecules in the context of drug design [4]. Here, \neach molecule is represented by a bag of possible conformations. The efficacy of \na molecule can be tested experimentally, but there is no way to control for indi(cid:173)\nvidual conformations. A second application is in image indexing for content-based \nimage retrieval. Here, an image can be viewed as a bag of local image patches [9] \nor image regions. Since annotating whole images is far less time consuming then \nmarking relevant image regions, the ability to deal with this type of weakly anno(cid:173)\ntated data is very desirable. Finally, consider the problem of text categorization for \nwhich we are the first to apply the MIL setting. Usually, documents which contain \na relevant passage are considered to be relevant with respect to a particular cate-\n\n\fgory or topic, yet class labels are rarely available on the passage level and are most \ncommonly associated with the document as a whole. Formally, all of the above \napplications share the same type of label ambiguity which in our opinion makes a \nstrong argument in favor of the relevance of the MIL setting. \n\nWe present two approaches to modify and extend Support Vector Machines (SVMs) \nto deal with MIL problems. The first approach explicitly treats the pattern labels \nas unobserved integer variables, subjected to constraints defined by the (positive) \nbag labels. The goal then is to maximize the usual pattern margin, or soft-margin, \njointly over hidden label variables and a linear (or kernelized) discriminant func(cid:173)\ntion. The second approach generalizes the notion of a margin to bags and aims at \nmaximizing the bag margin directly. The latter seems most appropriate in cases \nwhere we mainly care about classifying new test bags, while the first approach \nseems preferable whenever the goal is to derive an accurate pattern-level classifier. \nIn the case of singleton bags, both methods are identical and reduce to the standard \nsoft-margin SVM formulation. \n\nAlgorithms for the MIL problem were first presented in [4, 1, 7]. These methods (and \nrelated analytical results) are based on hypothesis classes consisting of axis-aligned \nrectangles. Similarly, methods developed subsequently (e.g., [8, 12]) have focused \non specially tailored machine learning algorithms that do not compare favorably in \nthe limiting case of the standard classification setting. A notable exception is [10]. \nMore recently, a kernel-based approach has been suggested which derives MI-kernels \non bags from a given kernel defined on the pattern-level [5]. While the MI-kernel \napproach treats the MIL problem merely as a representational problem, we strongly \nbelieve that a deeper conceptual modification of SVMs as outlined in this paper is \nnecessary. However, we share the ultimate goal with [5], which is to make state-of(cid:173)\nthe-art kernel-based classification methods available for multiple-instance learning. \n\n2 Multiple-Instance Learning \n\nIn statistical pattern recognition, it is usually assumed that a training set of la(cid:173)\nbeled patterns is available where each pair (Xi, Yi) E ~d X Y has been generated \nindependently from an unknown distribution. The goal is to induce a classifier, i.e., \na function from patterns to labels ! : ~d --+ y. In this paper, we will focus on \nthe binary case of Y = {-I, I}. Multiple-instance learning (MIL) generalizes this \nproblem by making significantly weaker assumptions about the labeling informa(cid:173)\ntion. Patterns are grouped into bags and a label is attached to each bag and not \nto every pattern. More formally, given is a set of input patterns Xl, ... , Xn grouped \ninto bags B l , ... , B m , with BI = {Xi: i E I} for given index sets I ~ {I, ... , n} (typ(cid:173)\nically non-overlapping). With each bag B I is associated a label YI. These labels \nare interpreted in the following way: if YI = -1, then Yi = -1 for all i E I, i.e., no \npattern in the bag is a positive example. If on the other hand YI = 1, then at least \none pattern Xi E BI is a positive example of the underlying concept. Notice that \nthe information provided by the label is asymmetric in the sense that a negative \nbag label induces a unique label for every pattern in a bag, while a positive label \ndoes not. In general, the relation between pattern labels Yi and bag labels YI can be \nexpressed compactly as YI = maxiEI Yi or alternatively as a set of linear constraints \n\n'\"' Yi + 1 \n~ -2- ;::: 1, VI s.t. YI = 1, and Yi = -1, VI s.t. YI = -1. \niEI \n\n(1) \n\nFinally, let us call a discriminant function! : X --+ ~ MI-separating with respect to \na multiple-instance data set if sgn maxiEI !(Xi) = YI for all bags BI holds. \n\n\f(a) \n\n3 \n\n1 -\n\n..... Q) \n\n..\u2022.. \n\n