{"title": "EM-DD: An Improved Multiple-Instance Learning Technique", "book": "Advances in Neural Information Processing Systems", "page_first": 1073, "page_last": 1080, "abstract": null, "full_text": "EM-DD:  An Improved  Multiple-Instance \n\nLearning Technique \n\nQi  Zhang \n\nSally A.  Goldman \n\nDepartment of Computer Science \n\nDepartment of Computer Science \n\nWashington University \n\nSt.  Louis,  MO  63130-4899 \n\nqz@cs. wustl. edu \n\nWashington University \n\nSt.  Louis,  MO  63130-4899 \n\nsg@cs. wustl. edu \n\nAbstract \n\nWe  present  a  new  multiple-instance  (MI)  learning technique  (EM(cid:173)\nDD)  that  combines  EM  with  the  diverse  density  (DD)  algorithm. \nEM-DD is a general-purpose MI algorithm that can be applied with \nboolean  or  real-value  labels  and  makes  real-value  predictions.  On \nthe boolean Musk benchmarks, the EM-DD algorithm without any \ntuning  significantly  outperforms  all  previous  algorithms.  EM-DD \nis  relatively  insensitive to the number of relevant  attributes  in  the \ndata set  and  scales  up  well  to  large  bag  sizes.  Furthermore,  EM(cid:173)\nDD  provides  a  new  framework  for  MI  learning,  in  which  the  MI \nproblem  is  converted  to  a  single-instance  setting  by  using  EM  to \nestimate the instance  responsible  for  the  label of the bag. \n\nIntroduction \n\n1 \nThe  multiple-instance  (MI)  learning  model  has  received  much  attention.  In  this \nmodel,  each  training  example  is  a  set  (or  bag)  of  instances  along  with  a  single \nlabel  equal  to  the  maximum label  among all  instances  in  the  bag.  The  individual \ninstances  within  the  bag  are  not  given  labels.  The  goal  is  to  learn  to  accurately \npredict  the  label  of previously  unseen  bags.  Standard  supervised  learning  can  be \nviewed  as  a special case of MI  learning where  each bag holds a  single instance.  The \nMI  learning model was  originally motivated by the  drug  activity prediction problem \nwhere  each  instance  is  a  possible  conformation  (or  shape)  of a  molecule  and  each \nbag  contains  all  likely  low-energy  conformations  for  the  molecule.  A  molecule  is \nactive  if it  binds  strongly to the target  protein  in  at least  one  of its conformations \nand  is  inactive if no  conformation binds  to  the protein.  The  problem is  to  predict \nthe label  (active or  inactive) of molecules based  on  their  conformations. \n\nThe  MI  learning  model  was  first  formaliz ed  by  Dietterich  et  al.  in  th eir  seminal \npaper  [4]  in which they developed  MI algorithms for  learning axis-parallel rectangles \n(APRs)  and  they  also  provided  two  benchmark  \"Musk\"  data sets.  Following this \nwork,  there  has  been  a  significant  amount of research  directed  towards  the  devel(cid:173)\nopment  of MI  algorithms  using  different  learning  models  [2 ,5,6,9,12].  Maron  and \n\n\fRaton  [7]  applied  the  multiple-instance  model  to  the  task  of recognizing  a  person \nfrom  a  series  of images  that  are  labeled  positive  if  they  contain  the  person  and \nnegative  otherwise.  The  same technique  was  used  to  learn  descriptions  of natural \nscene  images  (such  as  a  waterfall)  and  to  retrieve  similar images  from  a  large  im(cid:173)\nage database using  the  learned concept  [7].  More  recently,  Ruffo  [11]  has  used  this \nmodel for  data mining applications. \n\nWhile  the  musk  data  sets  have  boolean  labels ,  algorithms  that  can  handle  real(cid:173)\nvalue labels  are often desirable in real-world applications.  For example, the binding \naffinity  between  a  molecule  and  receptor  is  quantitative,  and  hence  a  real-value \nclassification  of binding strength  is  preferable  to  a  binary one.  Most  prior  research \non  MI  learning is  restricted  to concept  learning  (i.e.  boolean labels).  Recently,  MI \nlearning  with  real-value  labels  has  been  performed  using  extensions  of the  diverse \ndensity  (DD)  and  k-NN  algorithms  [1]  and using  MI  regression  [10]. \n\nIn  this  paper ,  we  present  a  general-purpose  MI  learning  technique  (EM-DD)  that \ncombines  EM  [3]  with  the  extended  DD  [1]  algorithm.  The  algorithm  is  applied \nto  both  boolean  and  real-value  labeled  data  and  the  results  are  compared  with \ncorresponding  MI  learning  algorithms from  previous  work.  In  addition,  the  effects \nof  the  number  of  instances  per  bag  and  the  number  of  relevant  features  on  the \nperformance  of EM-DD  algorithm  are  also  evaluated  using  artificial  data  sets .  A \nsecond  contribution  of this  work  is  a  new  general  framework  for  MI  learning  of \nconverting  the  MI  problem  to  a  single-instance  setting  using  EM.  A  very  similar \napproach  was  also  used  by  Ray  and  Page  [10]. \n\n2  Background \nDietterich et  al.  [4],  presented  three algorithms for  learning APRs in the MI  model. \nTheir best performing algorithm (iterated-discrim) , starts with a point in the feature \nspace  and  \"grows\"  a  box  with  the  goal  of finding  the  smallest  box  that  covers  at \nleast  one  instance from  each  positive bag  and  no  instances  from  any  negative  bag. \nThe  resulting  box  was  then  expanded  (via  a  statistical  technique)  to  get  better \nresults.  However,  the  test data from Muskl was  used  to tune the parameters of the \nalgorithm.  These  parameters are  then  used  for  Muskl  and Musk2. \n\nAuer  [2]  presented  an algorithm, MULTINST,  that learns using simple statistics to \nfind the halfspaces defining the boundaries of the target APR and hence avoids some \npotentially hard computational problems that were  required  by  the heuristics  used \nin the iterated-discrim algorithm.  More  recently, Wang and  Zucker  [11]  proposed  a \nlazy learning approach by applying two variant of the k nearest  neighbor  algorithm \n(k-NN)  which  they  refer  to  as  citation-kNN  and  Bayesian  k-NN.  Ramon  and  De \nRaedt  [9]  developed  a  MI  neural  network  algorithm. \n\nOur  work  builds  heavily  upon  the  Diverse  Density  (DD)  algorithm of Maron  and \nLozano- Perez  [5,6].  When describing the shape of a molecule by n  features , one can \nview each conformation of the molecule as  a  point in a  n-dimensional feature space. \nThe  diverse  density  at  a  point p  in  the  feature  space  is  a  probabilistic m easure  of \nboth  how  many  different  positive  bags  have  an  instance  near  p,  and  how  far  the \nnegative instances  are from p.  Intuitively, the  diversity density  of a  hypothesis  h  is \njust  the  likelihood  (with  respect  to  the  data)  that  h  is  the  target.  A  high  diverse \ndensity  indicates  a  good candidate  for  a  \"true\"  concept. \n\nWe  now  formally  define  the  general  MI  problem  (with  boolean  or  real-value  la-\n\n\fbels)  and  DD  likelihood  measurement  originally  defined  in  [6]  and  extended  to \nreal-value  labels  in  [1].  Let  D  be  the  labeled  data  which  consists  of  a  set  of  m \nbags  B  = {B 1, ... , B m }  and  labels  L  = {l\\, ... ,\u00a3m },  i.e.,  D  = {< B 1,\u00a3l  >, ... ,< \nBm, \u00a3m  >}.  Let  bag  Bi  =  {Bil \" '\"  B ij , ... Bin}  where  Bij  denote  the  lh  in-\nstance  in  bag  i.  Assume  the  labels  of the  instances  in  Bi  are  \u00a3i 1,  ... , \u00a3ij, ... , \u00a3in . \nFor  boolean  labels,  \u00a3i  =  \u00a3i1  V \u00a3i2  V  ... V \u00a3in,  and  for  real-value  labels,  \u00a3i  = \nmax{ \u00a3il, \u00a3i2,  ... , \u00a3in}.  The  diverse  density  of hypothesized  target  point  h  is  de-\nfi  d \nne  as  D D  = Pr \n=  ( ) .  ssummg  a \nuniform prior  on  the  hypothesis  space  and  independence  of < B i , \u00a3i  > pairs given \nh , using  Bayes'  rule, the maximum likelihood hypothesis , hDD ,  is defined  as: \narg maxPr(D  I h) = arg m ax IT Pr(Bi , \u00a3i  I h) = arg min I) -log Pr(\u00a3i  I h , B i )) \n\nPr(B , L  I h) Pr(h)  A \n\nPr(D  I h) Pr(h) \n\n() \nPr  D \n\n(h  I \n\n) \n\nD  = \n\nPr  B , L \n\n(h) \n\n. \n\nn \n\nhEH i=l \n\nn \n\nhEH i=l \n\nhEH \n\nwhere  Label (Bi  I h)  is  the  label  that  would  be  given  to  B i  if h  were  the  correct \nhypothesis.  As  in  the  extended  DD  algorithm  [1],  Pr(\u00a3i  I h , Bi)  is  estimated  as \nl-I\u00a3i  - Label (Bi  I h) I in [1].  When the labels are boolean  (0  or 1) , this formulation \nis exactly the most-likely-cause estimator used  in the original DD algorit hm [5].  For \nmost  applications  t he  influence  each  feature  has  on  t he  label  varies  greatly.  This \nvariation  is  modeled  in  the  DD  algorithm  by  associating  with  each  attribute  an \n(unknown)  scale  factor .  Hence  the  target  concept  really  consists  of  two  values  per \ndimension , the ideal attribute value and the scale  value.  Using the assumption that \nbinding  strength  drops  exponentially  as  the  similarity  between  the  conform ation \nto  the  ideal  shape  increases ,  the  following  generative  model  was  introduced  by \nMaron  and  Lozano-Perez  [6]  for  estimating  the  label  of  bag  B i  for  hypothesis  h  = \n{h 1 ,  ... , hn , Sl , ... , sn}  : \n\nLabel(Bi I h)  =max{ exP[- t\n\n(Sd(Bijd - hd)) 2]} \n\n(1) \n\nJ \n\nd=l \n\nwhere  Sd  is  a  scale  factor  indicating  the  importance of feature  d,  h d is  the  feature \nvalue for  dimension d, and  B ijd is  the feature value of instance  B ij  on dimension d. \nLet  NLDD(h , D)  = 2::7=1 (-log Pr(\u00a3i  I h , B i )) ,  where  NLDD  denote  the  negative \nlogarit hm of DD.  The  DD  algorithm [6] uses  a  two-step  gradient descent  search  to \nfind  a  value of h  that minimizes  NLDD  (and hence  maximizes  DD). \n\nRay and Page  [10]  developed multiple-instance  regression  algorithm which  can  also \nhandle  real-value  labeled  data.  They  assumed  an  underlying  linear  model  for  the \nhypothesis  and applied the algorithm to some artificial data.  Similar to the current \nwork, they also used  EM to select  one instance from  each bag so  multiple regression \ncan  be  applied to  MI  learning. \n\n3  Our  algorithm:  EM-DD \nWe  now  describe  EM-DD  and  compare  it  with  the  original  DD  algorithm.  One \nreason  why  MI  learning  is  so  difficult  is  the  ambiguity  caused  by  not  knowing \nwhich  instance  is  the  important  one.  The  basic  idea  behind  EM-DD  is  to  view \nthe  knowledge  of which  instance  corresponds  to  the  label  of th e  bag  as  a  missing \nattribute  which  can  be  estimated  using  EM  approach  in  a  way  similar to  how  EM \nis  used  in  the MI  regression  [10].  EM-DD starts  with some initial guess  of a  target \npoint  h  obtained  in  the  standard  way  by  trying  points  from  positive  bags,  then \nrepeatedly  performs  the  following two  steps  that combines  EM  with  DD  to  search \nIn  the  first  step  (E-step) ,  the  current \nfor  the  maximum  likelihood  hypothesis. \n\n\fhypothesis  h is  used  to pick one  instance from each  bag which is  most likely  (given \nour  generative  model)  to  be the  one  responsible  for  the  label  given  to  the  bag.  In \nthe second step  (M -step),  we  use  the two-step gradient ascent search  (quasi-newton \nsearch dfpmin in  [8])  of the standard DD algorithm to find  a  new  hi  that maximizes \nDD(h).  Once  this  maximization step  is  completed ,  we  reset  the  proposed  target \nh  to  hi  and  return  to  the first  step  until  the  algorithm converges.  Pseudo-code  for \nEM-DD is  given  in  Figure  1. \n\nWe now briefly provide intuition as to why EM-DD improves both the accuracy and \ncomputation time of the  DD  algorithm.  Again,  the basic approach  of DD  is  to  use \na  gradient search  to find  a  value of h  that maximizes DD(h).  In every  search  step , \nthe  DD  algorithm uses  all  points in  each  bag and  hence  the  maximum that occurs \nin  Equation  (1)  must  be  computed.  The  prior  diverse  density  algorithms  [1,5,6,7] \nused  a  softmax approximation for  the maximum (so  that  it  will  be  differentiable), \nwhich dramatically increases  the computation complexity and introduces  additional \nerror  based on the parameter selected  in softmax.  In comparison,  EM-DD converts \nthe multiple-instance data to single-instance data by removing all but one point per \nbag in the  E -step,  which greatly simplifies the search step since the maximum that \noccurs  in  Equation  (1)  is  removed  in  the  E -step.  The  removal of softmax in  EM(cid:173)\nDD  greatly  decreases  the  computation time.  In  addition,  we  believe  that  EM-DD \nhelps  avoid  getting  caught  in  local  minimum since  it  makes  major changes  in  the \nhypothesis  when  it switches  which  point is  selected  from  a  bag. \n\nWe  now  provide  a  sketch  of  the  proof  of  convergence  of  EM-DD.  Note  that  at \neach  iteration  t ,  given  a  set  of  instances  selected  in  the  E-step,  the  M-step  will \nfind  a  unique  hypothesis  (h t )  and  corresponding  DD  (ddt).  At  iteration  t + 1,  if \nddt+1  ::;  ddt ,  the  algorithm  will  terminate.  Otherwise,  ddt+1  >  ddt ,  which  means \nthat a  different  set  of instances  are selected.  For  the iteration to continue,  the  DD \nwill  decrease  monotonically and  the set  of instances selected  can  not  repeat.  Since \nthere  are only finite  number of sets  to instances  that  can be selected  at the  E-step , \nthe algorithm will terminate after  a  finite  number of iterations. \n\nHowever,  there  is  no  guarantee  on  the  convergence  rate  of  EM  algorithms.  We \nfound  that  the  NLDD(h , D)  usually  decreases  dramatically  after  the  first  several \niterations  and  then  begins  to flatten  out.  From empirical tests  we  found  that it is \noften beneficial to allow NLDD to increase slightly to escape a local minima and thus \nwe  used  the  less  restrictive  termination condition:  Idd1  - ddo I <  0.01  . ddo or  the \nnumber of iterations is greater than 10.  This modification reduces  the training time \nwhile  gaining  comparable  results.  However,  for  this  modification  no  convergence \nproof can  be given  without restricting  the number of iterations. \n\n4  Experimental results \nIn  this  section  we  summarize our experimental results.  We  begin  by  reporting  our \nresults  for  the  two  musk  benchmark  data  sets  provided  by  Dietterich  et  al.  [4]. \nThese  data sets  contain  166  feature  vectors  describing  the  surface  for  low-energy \nconformations of 92 molecules for Muskl and 102 molecules for Musk2 wh ere roughly \nhalf of the  molecules  are  known  to  smell  musky  and  the  remainder  are  not.  The \nMusk1  data set  is  smaller both in having fewer  bags  (i.e molecules)  and many fewer \ninstances  per  bag  (an  average  of 6.0  for  Musk1  versus  64.7  for  Musk2).  Prior  to \nthis work,  the  highly-tuned  iterated-discrim  algorithm of Dietterich et  al.  still gave \nthe  best  performance  on  both  Musk1  and  Musk2.  Maron  and  Lozano-Perez  [6] \n\n\fMain(k , D) \n\npartition D = {D1 ' D2, ... , D 10 };  111 O-fold  cross  validation \nfor  (i  =  l ;i:::;  10 ;i++) \n\nIIDt  training  data ,  Di  validation  data \n\nDt  =  D  - Di ; \npick  k  random positive bags  B 1 , ... , B k  from  Dt ; \nlet  Ho  be the  union of all  instances from selected  bags; \nfor  every  instance  Ij  E  H 0 \n\nhj  =  EM-DD (Ij, Dt ); \nei  = mino:<;:j:<;:IIHoll{error(hj,Di)}; \nreturn  avg(e1,e2, ... , e1o) ; \n\nEM-DD(I , Dt ) \n\nLet  h = {h1' ... , hn , Sl, ... , sn}; \nFor  each  dimension d = 1, ... , n \n\nIlinitial hypothesis \n\nhd  =  Id; \nnlddo =  +00; \nwhile  (nldd 1 < nlddo) \n\nSd  =  0.1 ; \nnldd1 =  NLDD(h, Dt); \n\nfor  each  bag  Bi  E  Dt \n\npi  =  argmaxBijEBi Pr(Bij  E  h); \n\nhi  =  argmaXhEH  flP r(fi  I h , pi); \nnlddo =  nldd1; \n\nnldd1 =  NLDD(hl,Dt); \n\nliE-step \n\n11M-step \n\nh =  hi; \n\nreturn  h; \n\nFigure 1:  Pseudo-code for EM-DD where k indicates the number of different starting \nbags used,  Pr(Bij  E h)  =  exp[- I:~=1 (sd(Bijd - hd))2].  Pr(fi  I h , p,!)  is  calculate as \neither 1-lfi - Pr(pi  E  h) I (linear model) or exp [-( fi - Pr(pi  E  h) )2]  (Gaussian-like \nmodel) , where  Pr(pi  E h)  =  maxBijEBi Pr(Bij  E h). \n\nsummarize the  generally  held  belief that  \"The  performance  reported  for  iterated(cid:173)\ndiscrim  APR  involves  choosing  parameters  to  maximize  the  test  set  performance \nand so  probably represents  an upper bound for  accuracy on this  (Musk1)  data set.\" \n\nEM-DD without tuning outperforms all previous algorithms.  To be consistent with \nthe  way  in  which  past  results  have  been  reported  for  the  musk  benchmarks  we \nreport  the average  accuracy  of la-fold cross-validation  (which is  the value returned \nby Main in  Figure  l.  EM-DD obtains an average  accuracy  of 96.8% on  Musk1  and \n96.0%  on  Musk2.  A  summary of the  performance  of different  algorithms  on  the \nMusk1  and  Musk2  data  sets  is  given  in  Table  l.  In  addition ,  for  both  data sets , \nthere  are  no  false  negative  errors  using  EM-DD ,  which  is  important for  the  drug \ndiscovery  application  since  the  final  hypothesis  would  be  used  to  filter  potential \ndrugs  and  a  false  negative error  means  that a  potential good  drug  molecule  would \nnot  be  tested  and  thus  it  is  good  to  minimize  such  errors.  As  compared  to  the \nstandard  DD  algorithm , EM-DD  only  used  three  random bags for  Muskl  and  two \nrandom bags for  Musk2  (versus  all positive  bags  used  in  DD)  as  the  starting point \nof the  algorithm.  Also,  unlike  th e  results  reported  in  [6]  in  which  the  threshold  is \ntuned  based  on leave-one-out cross  validation, for our reported  results the threshold \nvalue  (of  0.5)  is  not  tuned.  More  importantly,  EM-DD  runs  over  10  times  faster \nthan  DD on  Musk1  and over  100 times faster  when  applied  to  Musk2. \n\n\fTable  1:  Comparison of performance  on  Musk1  and  Musk2  data sets  as  measured \nby  giving the  average accuracy  across  10  runs  using  10-fold cross  validation. \n\nAlgorithm \n\nEM-DD \nIterated-discrim  [4] \nCitation-kNN  [11] \nBayesian-kNN  [11] \nDiverse  density  [6] \nMulti-instance neural  network  [9] \nMultinst  [2] \n\nMusk1 \naccuracy \n96.8% \n92.4% \n92.4% \n90.2% \n88.9% \n88.0% \n76.7% \n\nMusk  2 \naccuracy \n96.0% \n89.2% \n86.3% \n82.4% \n82.5% \n82.0% \n84.0% \n\nIn  addition to its superior  performance on  the musk data sets,  EM-DD can  handle \nreal-value  labeled  data  and  produces  real-value  predictions.  We  present  results \nusing  one  real  data set  (Affinity)  1  that  has  real-value  labels  and  several  artificial \ndata sets generated  using the technique  of our earlier  work  [1].  For these  data sets, \nwe  used  as  our starting points the  points from the  bag with the  highest  DD  value. \nThe  result  are  shown  in  Table  2.  The  Affinity  data  set  has  283  features  and  139 \nbags  with  an  average  of 32.5  points  per  bag.  Only  29  bags  have  labels  that  were \nhigh enough  to  be considered  as  \"positive.\"  Using the  Gaussian-like version  of our \ngenerative  model  we  obtained  a  squared  loss  of 0.0185  and  with  the  linear  model \nwe  performed  slightly  better  with  a  loss  of 0.0164.  In  contrast  using  the standard \ndiverse  density  algorithm  the  loss  was  0.042l.  EM-DD  also  gained  much  better \nperformance  than  DD  on  two  artificial  data  (160.166.1a-S  and  80.166.1a-S)  where \nboth  algorithms were  used 2 .  The  best result  on Affinity data was  obtained using  a \nversion  of citation-kNN [1]  that works  with real-value data with the loss  as  0.0124. \nWe think that the  affinity data set is  well-suited for  a  nearest  neighbor  approach in \nthat  all  of the  negative  bags  have  labels  between  0.34  and  0.42  and  so  the  actual \npredictions for  the  negative bags  are  better  with citation-kNN. \n\nTo study the sensitivity of EM-DD to the number ofrelevant attributes and the size \nof the  bags,  tests  were  performed  on  artificial  data  sets  with  different  number  of \nrelevant features and bag sizes.  As shown in Table 2, similar to the DD algorithm [1], \nthe  performance of EM-DD  degrades  as  the number of relevant  features  decreases. \nThis  behavior  is  expected  since  all  scale  factors  are  initialized  to  the  same  value \nand when most of the features  are relevant less  adjustment is  needed  and hence the \nalgorithm is  more  likely  to succeed.  In  comparison to  DD ,  EM-DD  is  more robust \nagainst  the  change  of the  number  of relevant  features.  For  example,  as  shown  in \nFigure 2,  when the  number of relevant features  is  160 out of 166,  both EM-DD and \nDD  algorithms  perform  well  with  good  correlation  between  the  actual  labels  and \npredicted  labels.  However,  when  the  number  of relevant  features  decreases  to  80 , \nalmost  no  correlation  between  the  actual  and  predicted  labels  is  found  using  DD , \nwhile  EM-DD  can still provide  good predictions  on the  labels. \n\nIntuitively,  as  the  size  of bags  increases,  more  ambiguity is  introduced  to  the  data \nand  the  performance  of algorithms  is  expected  to  go  down.  However ,  somewhat \n\n] Jonathan  Greene  from  CombiChem  provided  us  with  the  Affinity  data set.  However, \n\ndue  to  the  proprietary  nature  of it  we  cannot  make it publicly  available. \n\n2See  Amar et  al.  [1]  for  a  description  of these  two  data sets. \n\n\fTable  2:  Performance on data with  real-value labels measured  as  squared  loss. \n\n#  reI.  features  #pts per bag  EM-DD  DD  [1] \n.0421 \n.0052 \n\n32.5 \n\nData set \nAffinity \n160.166.1a-S \n160.166.1b-S \n160.166.1c-S \n80.166.1a-S \n80.166.1b-S \n80.166.1c-S \n40.166.1a-S \n40.166.1b-S \n40.166.1c-S \n\n160 \n160 \n160 \n80 \n80 \n80 \n40 \n40 \n40 \n\n4 \n15 \n25 \n4 \n15 \n25 \n4 \n15 \n25 \n\n.0164 \n.0014 \n.0013 \n.0012 \n.0029 \n.0023 \n.0022 \n.0038 \n.0026 \n.0037 \n\n.1116 \n\nsurprisingly,  the  performance  of EM-DD  actually  improves  as  the  number  of ex(cid:173)\namples  per  bag  increases .  We  believe  that  this  is  partly  due  to  the  fact  that  with \nfew  points  per  bag  the  chance  that  a  bad  starting  point  has  the  highest  diverse \ndensity is  much higher than when the bags are large.  In addition, in contrast to the \nstandard diverse density algorithm , the overall time complexity of EM-DD does not \ngo  up  as  the size  of the bags  increased ,  since  after  the instance  selection  (E-step) , \nthe time complexities of the dominant M-step  are essentially the same for  data sets \nwith  different  bag  sizes.  The  fact  that  EM-DD  scales  up  well  to  large  bag  sizes \nin  both  performance  and  running  time  is  very  important  for  real  drug-discovery \napplications in which  the bags can be quite large. \n\n5  Future  directions \nThere are  many avenues for  future  work.  We believe that EM-DD can be refined  to \nobtain better  performance by finding  alternate ways  to select  the initial hypothesis \nand  scale  factors.  One  option would  be  to  use  the  result  from  a  different  learning \nalgorithm as  the starting  point  then  use  EM-DD  to  refine  the hypothesis.  We  are \ncurrently  studying the  application of the EM-DD  algorithm to other domains such \nas content-based image retrieval.  Since our algorithm is based on the diverse density \nlikelihood  measurement  we  believe  that  it  will  perform  well  on  all  applications  in \nwhich the standard diverse  density algorithm has worked well.  In addition , EM-DD \nand MI  regression  [10]  presented  a framework to convert the multiple-instance data \nto  single-instance  data,  where  supervised  learning  algorithms can  be  applied.  We \nare currently working on using this general m ethodology to develop new  MI learning \ntechniques  based  on supervised  learning  algorithms and EM. \n\nAcknowledgments \nThe  authors  gratefully  acknowledge  the  support  NSF  grant  CCR-9988314.  We \nthank Dan Dooly for  many useful discussions.  We also thank Jonathan Greene who \nprovided  us  with  the  Affinity data set . \n\nReferences \n[1]  Amar,  R.A.,  Dooly,  D.R.,  Goldman,  S.A.  &  Zhang,  Q.  (2001).  Multiple-Instance \nLearning  of  Real-Valued  Data.  Pr'oceedings  18th  International  Confer'ence  on  Machine \nLearning,  pp.  3- 10.  San  Francisco,  CA:  Morgan  Kaufmann. \n\n[2]  Auer,  P.  (1997)  On  learning  from  mult-instance  examples:  Empirical  evaluation  of a \ntheoretical  approach.  Proceedings  14th  International  Conference  on  Ma chine  Learning, \n\n\f160.166.1a-S  (DD) \n\n80.166.1a-S  (DD) \n\n0 . 8 \n\n0 . 6 \n\n0 . 4 \n\n0.2 \n\n..... \n\n. ~. \" \n' . \n\n0.2 \n\n0 . 8 \n\n0 . 6 \n\n0 . 4 \n\n0.2 \n\n,  . \n\n- ~-: :- T.;-~ --- - - '- -\n\n:  .... \n\n' ':~.,::. - -\n..' \n\n0 . 4 \n\n0.6 \n\nActual \n\n0 . 8 \n\n0.2 \n\n0 . 4 \n\n0.6 \n\n0 . 8 \n\nActual \n\n160.166.1a-S  (EM-DD) \n\n80.166.1a-S  (EM-DD) \n\n0 . 8 \n\n~ \n\n~  0 . 6 \n\n~  0 . 4 \n\n0.2 \n\n\",  .. \n, .. :'  . \n\n0.2 \n\n\"  ,': \n\n.::::.; \":\" .. \n\n:',\" \n\n0 . 8 \n\n~ \n\n~  0 . 6 \n\n~  0 . 4 \n\n0.2 \n\n':': .... \n\n0 . 4 \n\n0.6 \n\nActual \n\n0 . 8 \n\n0.2 \n\n0 . 4 \n\n0.6 \n\n0 . 8 \n\nActual \n\nFigure  2:  Comparison of EM-DD  and  DD  on  real-value labeled  artificial data with \ndifferent  number  of relevant  features.  The  x-axis  corresponds  to  the  actual  label \nand  y-axis  gives  t h e  predicted  label. \n\npp.  21-29.  San Francisco,  CA:  Morgan  Kaufmann. \n\n[3]  Dempster, A.P.,  Laird,  N.M. , &  Rubin,  D.B.  (1977).  Maximum likelihood from  incom(cid:173)\nplete  data via  the  EM algorithm.  Journal  of the  Royal Statistics  Society,  Series  B,  39 (1): \n1-38. \n\n[4]  Dietterich,  T.  G.,  Lathrop ,  R.  H.,  &  Lozano-Perez,  T.  (1997).  Solving  the  multiple(cid:173)\ninstance  problem  with axis-parallel  rectangles.  Artificial Intelligence,  89(1-2):  31-7l. \n\n[5] Maron,  O.  (1998).  Lea rning from  Ambiguity.  Doctoral dissertation,  MIT,  AI  Technical \nReport  1639. \n\n[6]  Maron,  O.  &  Lozano-Perez,  T.  (1998).  A  framework  for  multiple-instance  learning. \nNeural  Information Processing Systems  10.  Cambridge,  MA:  MIT Press. \n\n[7]  Maron,  O.  &  Ratan,  A.  (1998).  Multiple-instance  learning  for  natural scene  classifica(cid:173)\ntion.  Proceedings  15th  International  Conference  on  Machine  Learning,  pp.  341-349.  San \nFrancisco,  CA:  Morgan  Kaufmann. \n\n[8] Press,  W.H.,  Teukolsky,  S.A.,  Vetterling,  W .T.,  and  Flannery,  B.P.  (1992).  Numerical \nRecipes  in  C:  the  art  of scientific  computing .  Cambridge  University  Press,  New  York, \nsecond  edition. \n\n[9]  Ramon,  J.  &  L.  De  Raedt.  (2000).  Multi  instance  neural  networks.  Proceedings  of \nI CML -2000  workshop  on  \"Attribute- Value  and  Relational Learning. \n\n[10]  Ray,  S.  &  Page,  D.  (2001) .  Multiple-Instance  Regression.  Proceedings  18th  Inter(cid:173)\nnational  Conference  on  Machine  Learning,  pp.  425-432.  San  Francisco,  CA:  Morgan \nKaufmann. \n\n[11]  RufIo,  G .  (2000) .  Learning  single  and  multiple  instance  dec is io n  tr'ees  for'  co mputer' \nsecurity applica tions.  Doctoral dissertation.  Department of Computer Science,  University \nof Turin,  Torino,  Italy. \n\n[12]  Wang,  J.  &  Zucker,  J.-D.  (2000).  Solving  the  Multiple-Instance  Learning  Problem:  A \nLazy Learning Approach.  Proceedings 17th International Conference on Ma chin e Learning, \npp.  1119-11 25 .  San Francisco,  CA:  Morgan  Kaufmann. \n\n\f", "award": [], "sourceid": 1959, "authors": [{"given_name": "Qi", "family_name": "Zhang", "institution": null}, {"given_name": "Sally", "family_name": "Goldman", "institution": null}]}